Mason Wang

FLAM: Frame-Wise Language-Audio Modeling

Frame-wise objective, adjustment to remove spurious correlations like event dependencies and label imbalances.

Previous work

good at retrieval, understanding, and text-conditioned generation. instance level alignments between audio and text (e.g. CLAP) cannot find boundaries of acoustic events - bad for audio content search and event detection frame-level annotations are rare, however SED datasets have a limited vocab, remain small in size, due to human annotation effort.

overall text-data volume limits self-supervised approaches

Contributions

Method

frame-level embeddings, as well as sample-level embedding. frame-level embeddings match with text embeddings.

Dataset

improves open-vocabulary localization maintains retrieval/downstream task performance

SED

each frame can contain a variable number of events, including none open vocabulary - unlimited number of prompts, probabilities for each frame and event

Current ALMs

Efficiency

Logit Adjustment

Experiments

Sound event detection - graph

Sound event detection performance

retrieval

zero-shot classification

Ablations

Last Reviewed: 7/16/2025