Eloi Moliner, Jaakko Lehtinen and Vesa Välimäki
Companion page for a paper submitted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2023
Rhodes, Greece, May, 2023
The article can be downloaded here.
This paper presents CQT-Diff, a data-driven generative audio model that can, once trained, be used for solving various different audio inverse problems in a problem-agnostic setting. CQT-Diff is a neural diffusion model with an architecture that is carefully constructed to exploit pitch-equivariant symmetries in music. This is achieved by preconditioning the model with an invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency axis represents pitch equivariance as translation equivariance. The proposed method is evaluated with objective and subjective metrics in three different and varied tasks: audio bandwidth extension, inpainting, and declipping. The results show that CQT-Diff outperforms the compared baselines and ablations in audio bandwidth extension and, without retraining, delivers competitive performance against modern baselines in audio inpainting and declipping. This work represents the first diffusion-based general framework for solving inverse problems in audio processing.
This webpage contains the supplementary material of the paper, including audio examples from the tasks evaluated in the paper and examples of other inverse problems.
CQTDiff can be inferred to generate unconditional piano music.
|
|
|
|
|
|
Since the architecture is fully convolutional, CQTDiff can generate sounds of arbitrary length (only limited by memory constraints).
|
|
|
|
|
|
CQT-Diff can be conditioned for the task of audio bandwidth extension. In this case, the degradation model is a lowpass filter. We explore the versatility of our model to reconstruct the high-frequency content of an audio signal after being degraded with different types of lowpass filters.
Filter specifications: FIR Kaiser window, \(f_c\) = 1 kHz, order = 500
Filter specifications: FIR Kaiser window, \(f_c\) = 3 kHz, order = 500
Note: The Sashimi-Diff and STFT-Diff conditions were excluded from the listening test experiment for length reasons. Here, we include them for completeness.
Our model is not restricted to FIR filters, we can also apply it when the lowpass filter is IIR. We apply a Chebyshev Type 1 filter of order 6 at the cutoff of \(f_c =\) 1 kHz. The reconstruction is done using reconstruction guidance, without data consistency steps.
Original |
Lowpased \(f_c =\) 1 kHz |
Bandwidth extended |
|
|
|
|
|
|
|
|
|
We experiment on applyig CQTDiff for bandwidth extension after having resampled the lowpassed audio. This experiment is meant to demonstrate that CQTDiff does effectively generate the high-frequency content instead of "reversing" the lowpass filter frequency response. Thus, we use torchaudio's resampling function, which uses a sinc interpolator, as the degradation. For maximum flexibility, the conditioning is applied using only reconstruction guidance and no data consistency, using the guidance scaling \(\xi^\prime=0.25\).
The bandwidth-reduced audio files are resampled at \(f_s =\) 2 kHz. Your browser might not be able to stream it, so we suggest downloading the audio files and reproducing them with a local audio player.
Original |
Resampled \(f_s =\) 2 kHz (Please download the audio file) |
Bandwidth extended |
|
|
|
|
|
|
|
|
|
We test CQTDiff in a condition where the low-resolution observations have been downsampled by applying naive decimation, without any kind of antialiasing measure. Again, we apply only reconstruction guidance with a scaling of \(\xi^\prime=0.25 \).
Original |
Decimated by a factor of 11 (Please download the audio file) |
Bandwidth extended |
|
|
|
|
|
|
|
|
|
Using recconstruction guidance,wwithout any further modification, the proposed method is robust to the appearance of gaussian noise in the observations. We experiment in a bandwidth extension scenario where the lowpassed signal has been contaminated with gaussian noise at diifferent levels of Signal-to-Noise Ratio. For this experiment we use FIR filters designed with a Kaiser window, with the cutoff frequency (\( f_c =\) 1 kHz) and order=500.
Original |
Lowpassed (\( f_c =\) 1 kHz) |
Bandwidth extended |
|
|
|
|
|
|
|
|
|
CQT-Diff can be conditioned for the task of audio inpaintng. In this case, the degradation model is a linear mask. We explore the ability of our model to reconstruct audio signals with gaps of different lengths.
The gap is always placed in the middle.
The baseline Catch-A-Waveform (CAW) did not produce competitive results on this comparisson. The reason is that, for the particular piano music we tested on, its generated output was noisy and slightly distorted. This goes in line with some of the failure cases reported in the official webpage. The way CAW works consists of overfitting a pyramidal ensemble of GANs with a short audio example. In our experiment, we used the inpainting tool from the official code repository and trained the model with the remaining seconds of masked piano.
The examples from the GACELA baeline were computed using the code from the official repository, and the weights published by the authors. GACELA uses a separately trained model for each different gap length. The evaluated gap lengths are determined by the published models of GACELA. The three models were trained using the same MAESTRO dataset as we use for evaluation.
Note: The gap is always placed in the middle.
Note: The gap is always placed in the middle.
We experiment on applying CQT-Diff for declipping as a non-linear inverse problem. We report here a set of audio examples (including two compared baselines) in two Signal-to-Distortion Ratio conditions (SDR=1dB and SDR=10dB). CQT-Diff performs better than the baselines in the most challenging condition of SDR=1dB.
We experiment on applying our method for compressive sensing, where the task is to reconstruct an audio signal when a large percentage of random audio samples have been dropped out. We use reconstruction guidance with \( \xi^\prime =\)0.25. The schedule consists of \(T=\) 35 steps with \(\rho =\) 13.
Original |
Compressed |
Reconstructed |
|
|
|
|
|
|
|
|
|
The proposed method has certain limitations that need to be pointed out. First of all, the training data was restricted to a piano music dataset. For a more practical application, the model would need to be trained with large-scale audio data, which would probably require scaling its capacity. Another limitation is that the inference time is slow, as a consequence of the over-completeness of the rasterized CQT and the use of reconstruction guidance, which requires computing a backpropagation step. Further work needs to be done in order to improve the efficieny of the model.
One of the most relevant limitations that we observed is that CQTDiff fails at solving the task of low-frequency bandwidth extension. In this case, the observations are produced after applying a high-pass filter. Although the model is able to generate some low-frequency content, this lacks coherence with the high-frequency observations. We hypothesize that this behaviour is a consequence of the diffusion process design, which is biased to generate the low-frequencies earlier than the higher ones. In the experiment, we applied a FIR high-pass filter design with a Kaiser window, with order 500 and a cutoff at 500Hz.
Original |
Highpassed \(f_c =\) 500 Hz |
Bandwidth extended |
|
|
|
|
|
|
|
|
|