Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information

Francesc Lluís, Nils Meyer-Kahlen

Companion page for the paper "Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information" submitted for ICASSP 2025.

Abstract

For audio in augmented reality (AR), knowledge of the users' real acoustic environment is crucial for rendering virtual sounds that seamlessly blend into the environment. As acoustic measurements are usually not feasible in practical AR applications, information about the room needs to be inferred from available sound sources. Then, additional sound sources can be rendered with the same room acoustic qualities. Crucially, these are placed at different positions than the sources available for estimation. Here, we propose to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information. Then, a diffusion-based spatial room impulse response generator is trained to take the latent space and generate a new response, given a new source-receiver position. We show how both room- and position-specific parameters are considered in the final output.

Examples: Spatial Analysis

These examples show one channel of true and generated SRIR in the time domain along with directional analysis results, as provided by the analysis stage of the spatial decomposition method [1]. All plots use the same dynamic range; spatial data is shown for time instances where the sound pressure is no less than 24dB below the direct sound. These plots should demonstrate that the temporal and spatial structure of the generated responses is similar to the ground truth responses. They also show the interaural coherence of rendered binaural responses. In the leftmost plots, red squares indicate source positions. Black dots ground truth positions, and blue dots generated positions. The receiver is facing in positive x direction.

Room 1 In this example, Pos. 1 (the generated SRIR) and Pos. 2 are close to each other. The direct sound direction (approximately 130 degree and 150 degree, respectively) and the DRR are similar. With the high DRR, the direct sound dominates both spatial plots. Examining the temporal structure of the genereted response and comparing it to the true response of Pos 2, one would not be able to identify it as generated. Pos. 3 further away from the sound source. Hence, the DRR is lower and the spatial analysis result shows more late energy than in the case of the other two responses.

Room 2 Here, the response for position 2 was generated.

Room 3 This is a very dry room, which makes the direct sound position very apparent in the maps. The interaural coherence is higher than in the other rooms shown.

Room 4 A more reverberant room.

Room 5 In this example, the response for Pos 2 was generated. It is located close to Pos 1. Hence, the direct sound for both of them is on the right-rear, at around 150 degrees.

Listening Examples

The examples show three renderings. Two use simulated responses from different positions within the same room, whereas the third uses a generated response. Binaural renderings were produced using the spatial decomposition method based on four-channel responses.

Please use headphones for the correct binaural experience.

	Ground Truth 1 in Room	Ground Truth 2 in Room	Generated
Room 1
Room 2
Room 3
Room 4
Room 5

References

[1] S. Tervo, J. Pätynen, A. Kuusinen, and T. Lokki, "Spatial Decomposition Method for Room Impulse Responses," J. Audio Eng. Soc., vol. 61, no. 1, pp. 17–28, Mar. 2013.