Realistic Gramophone Noise Synthesis using a Diffusion Model

Eloi Moliner and Vesa Välimäki

Companion page for a paper in the The 25th International Conference on Digital Audio Effects (DAFx20in22)
Vienna, Austria, September, 2022

The article can be downloaded here.

See the repository on Github

Try a demo on Colab

Abstract

This paper introduces a novel data-driven strategy for synthesizing gramophone noise textures. A diffusion probabilistic model is applied to generate highly realistic quasiperiodic noises. The proposed model is designed to generate samples of length equal to one disk revolution, but a method to generate plausible periodic variations between revolutions is also proposed. A guided approach is also applied as a conditioning method, where an audio signal generated with manually-tuned signal processing, is refined via reverse diffusion to appear more realistically sounding. The method is evaluated in a subjective listening test, where the participants were often unable to recognize the synthesized signals from the real ones. The synthetic noises produced with the best proposed unconditional method are statistically indistinguishable from real noise recordings. This work shows the potential of diffusion models for highly realistic audio synthesis tasks.

distribution

Real Gramophone Examples

Only-noise segments extracted from real gramophone recordings

Unconditional synthesis examples

distribution
Fig. Diagram of the unconditional synthesis using a reverse difusion process
T=25 T=50 T=150

Guided synthesis examples

distribution
Fig. Diagram of the guided synthesis
Guides 𝜏0=0.33 𝜏0=0.5 𝜏0=0.66

Extra: Other audio texture synthesis

In order to explore the potential of the presented diffusion model for audio texture synthesis, we experimented with retraining the model with other kinds of (noisy) audio textures. We used audio examples from different classes of the AudioSet dataset. Below, we uploaded some audio examples of the class, and some unconditionally generated examples. Note that, for convenience, the model was trained with a fixed reduced sequence length of 1 second.

Rain

Real examples (10 s) Synthesized examples (1 s)

Applause

Real examples (10 s) Synthesized examples (1 s)