Olli Rummukainen, Jenni Radun, Toni Virtanen, Ville Pulkki

Categorization of natural dynamic audiovisual scenes

Companion page for a paper published in PLoS One, 2014


This work analyzed the perceptual attributes of natural dynamic audiovisual scenes. We presented thirty participants with 19 natural scenes reproduced with an immersive audiovisual display in a similarity categorization task. Natural scene perception has been studied mainly with unimodal settings, which have identified motion as one of the most salient attributes related to visual scenes, and time-frequency domain continuation related to auditory scenes. However, controlled laboratory experiments with natural multimodal stimuli are still scarce. Our results show that humans pay attention to similar perceptual attributes in natural scenes, and a two-dimensional perceptual map of the stimulus scenes and perceptual attributes was obtained in this work. The exploratory results show the amount of motion, perceived noisiness, and eventfulness of the scene to be the most important perceptual attributes in naturalistically reproduced real-world urban environments. We found the scene gist properties openness and expansion to remain as important factors in scenes with no salient auditory or visual events. We propose that the study of scene perception should move forward to understand better the processes behind multimodal scene processing in real-world environments. We publish our stimulus scenes as spherical video recordings and sound field recordings in a publicly available database.

Spherical video and soundfield recording database

Low-resolution previews of the stimulus scenes are presented here. It is important to note that the video projection used in the experiment displayed only 226° of the horizontal FOV and 57° of the vertical FOV. The previews and the full-resolution video files presented here contain the full spherical video as it was obtained from the Point Grey Research: Ladybug 3 camera, and therefore the scenes contain some unrelated visual objects, like the researchers or the microphone, that were not visible in the actual experiment.

The resolution in the preview videos is 1920x960 px accompanied by a stereo rendering of the original soundfield recording. Full-resolution spherical videos (5400x2700 px, Huffyuv encoding) along with the uncompressed A-format microphone signals can be downloaded from the link below. The username and password are provided in the related publication, or they can be obtained by contacting the first author.

Download full-resolution spherical videos and A-format audio files (30.4 GB)








Lecture hall:

Market square:

Narrow space:

Quiet street:

Railway station:

Subway station:



Traffic behind:


Updated on Friday May 16, 2014
This page uses HTML5, CSS, and JavaScript
Creative Commons License
The spherical videos and soundfield recordings in the database by Olli Rummukainen, Jenni Radun, Toni Virtanen and Ville Pulkki
are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License