Hearing Anything Anywhere

1Stanford University, 2Sony AI, 3University of Maryland, College Park
CVPR 2024

*Indicates Equal Contribution

Rendered music from our model, trained on 12 room impulse response recordings from a real hallway. Headphones are strongly recommended.

Source audio from: doddlevloggle

Overview

We want to capture and reconstruct the spatial acoustic characteristics of a real room, to synthesize immersive auditory experiences

We require only:

  • Roughly 12 monoaural room impulse response (RIR) recordings.
  • A rough planar reconstruction of the scene.

We use this information to fit a differentiable acoustic inverse rendering framework (DiffRIR) with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity.

After training, DiffRIR can recover the fully immersive acoustic field of a room, and:

  • Render monoaural and binaural RIRs at new listener locations.
  • Render monoaural and binaural music at new listener locations.
  • Render realistic trajectories simulating the sonic experience of moving through the room.
  • Perform zero-shot scene modification like virtual speaker rotation and translation.

Another Trajectory

Rendered music from our model, trained on 12 room impulse response recordings from the Dampened Room. Headphones are strongly recommended.

Source audio from: doddlevloggle

Video Presentation

Full Video (Demos Included): Hearing Anything Anywhere - CVPR 2024

Virtual Speaker Rotation

Using the learned speaker directivity map from DiffRIR trained on 12 RIRs from a static scene, we can simulate virtual rotation of the speaker. We simulate a helicopter sound playing from the speaker, which is chosen for its broadband transients.

Dataset

The DiffRIR dataset contains real RIRs and music from four rooms: A Classroom, an acoustically Dampened Room, a Hallway, and a Complex Room with many surfaces. In the latter three rooms, we collect additional subdatasets where we vary the location and/or orientation of the speaker, or the presence and location of standalone whiteboard panels in the room. These are used to evaluate zero-shot generalization to changes in room layout. The dataset can be found on Zenodo.

Classroom
Classroom
Dampened Room
Dampened Room
Hallway
Hallway
Complex Room
Complex Room

BibTeX

@InProceedings{hearinganythinganywhere2024,
        title={Hearing Anything Anywhere},
        author={Mason Wang and Ryosuke Sawata and Samuel Clarke and Ruohan Gao and Shangzhe Wu and Jiajun Wu},
        booktitle={CVPR},
        year={2024}}