SoundCam

Abstract

A room’s acoustic properties are a product of the room’s geometry, as well as the objects within the room and their specific positions. A room’s acoustic properties can be characterized by its impulse response (RIR) between a source and listener location, or inferred roughly from recordings of natural signals present in the room. We present SoundCam, the largest dataset of unique RIRs from in-the-wild rooms released to date publicly. It includes 5,000 10-channel real-world measurements of room impulse responses and 2,000 10-channel recordings of music in three different rooms, including a controlled acoustic lab, an in-the-wild living room, and a conference room, with different humans in positions throughout each room. We show that these measurements can be used for interesting tasks, such as detecting and identifying the human, and tracking their position.

Rooms

Dark Room

Living Room

Conference Room

In each room, we collect 1000 measurements of the room's acoustic impulse response, while varying the location, presence, and identity of a human in the room. Each impulse response is measured from 10 microphones.

Tasks

The SoundCam dataset can be used to evaluate methods which:

Locate humans using room impulse responses
Identify which human is in a room using room impulses responses
Locate humans while music is playing in the room
Determine if someone is present in the room while music is playing
Generalize localization methods to other individuals
Test robustness of localization methods to changes in room layout

Below, we show some results from our best performing baseline for localization using a single RIR, in the acoustically treated room.

Video

Dataset

The dataset is hosted by the Stanford Data Repository: https://purl.stanford.edu/xq364hd5023

The compressed archives include both raw recordings and preprocessed impulse responses for all the subdatasets used in our experiments. Subdatasets are sorted by room, with some rooms' archives including recordings and data from more than one distinct experiment. 3Dscans.tar.gz includes textured 3D scans of each room, along with 3D scans of each human in the dataset (untextured to preserve anonymity).

Sample Dataset

We provide a small downloadable sample dataset: Download TreatedRoomSmallSet The files are from the Treated Room, preprocessed, but the number of data points has been significantly reduced. Information on the data's organization is included below.

Dataset Organization

The preprocessed data will serve most use cases. Its organization is as follows:

Hierarchy

Each subdataset file contains

One folder for each human in the dataset
A folder for the empty room

Preprocessed Files

Each data folder contains some or all of these files:

audio.npy - the recordings of each sweep, arranged by [N_datapoints, N_Microphones, N_samples]
adjusted_audio.npy - audio.npy, but time-adjusted such that the audio files from all datapoints are time-aligned, using the method described in Appendix E.3.
centroid.npy - the x,y locations of the human in the room. Shape is [N_datapoints, 2]
deconvolved.npy - the RIRs. Shape is [N_datapoints, N_Microphones, N_samples]
directlines.npy - the sweep signal as measured from a loopback signal, where the output of the audio interface is routed directly into an input. This is used to estimate the delay in the system. The shape is [N_datapoints, N_samples]
skeletons.npy - the poses and joint locations as captured by each of the three cameras. The shape is [N_datapoints, N_Cameras, N_joints, 3]. The indexing of the joints is provided here.
music_audio.npy - the recordings of each music file, arranged by [N_datapoints, N_Microphones, N_samples]
adjusted_music.npy - music_audio.npy, but time-adjusted such that the audio files from all datapoints are time-aligned, using the method described in Appendix E.3.
music_directlines.npy - the music signal as measured from a loopback signal, where the output of the audio interface is routed directly into an input. This is used to estimate the delay in the system. The shape is [N_datapoints, N_samples]
music_deconvolved.npy - RIRs as measured by deconvolving the music source from the music recording. Shape is [N_datapoints, N_Microphones, N_samples]
music_sources.npy - the source signal of each music file. Shape is [N_datapoints, N_samples]

Raw Files

The raw files are provided for completeness. Each folder contains raw recordings from each of the recording channels, as well as the skeletal poses from each camera, and depth maps.

Maintenance

Mason Wang and Samuel Clarke are maintaining the dataset. Mason Wang can be contacted at ycda@stanford.edu, and Samuel Clarke can be contacted at spclarke@stanford.edu.

Please contact us if you notice any errors with the dataset. To the extent that we notice errors, they will be fixed and the dataset will be updated. Previous versions of the dataset will be maintained. Errors and previous versions will be posted below.

BibTex

@inproceedings{wang2023soundcam,
    title={SoundCam: A Dataset for Finding Humans Using Room Acoustics},
    author={Mason Wang and Samuel Clarke and Jui-Hsien Wang and Ruohan Gao and Jiajun Wu},
    booktitle={Advances in Neural Informaion Processing Systems},
    year={2023}
}

SoundCam: A Dataset for Finding Humans Using Room Acoustics

NeurIPS 2023 Datasets and Benchmarks