FLAM: Frame-Wise Language-Audio Modeling
Frame-wise objective, adjustment to remove spurious correlations like event dependencies and label imbalances.
Previous work
good at retrieval, understanding, and text-conditioned generation. instance level alignments between audio and text (e.g. CLAP) cannot find boundaries of acoustic events - bad for audio content search and event detection frame-level annotations are rare, however SED datasets have a limited vocab, remain small in size, due to human annotation effort.
overall text-data volume limits self-supervised approaches
Contributions
- Framelevel open-vocabulary SED
- bias correction term, unbiased event classifier
- scalable (1M) data augmentation pipeline, with precise event boundaries
- open-set, closed-set SED, better than prior-self supervised approaches
- good retrieval, zero-shot classification
Method
frame-level embeddings, as well as sample-level embedding. frame-level embeddings match with text embeddings.
- frame-level constrastive objective
- logit adjustment techniques to remove spurious correlations
- memory-efficient training strategy
- synthetic data using 10-second audio mixtures - 1 million sample dataset
Dataset
- diverse audio events
- LLM generated captions
- simulation
improves open-vocabulary localization maintains retrieval/downstream task performance
SED
each frame can contain a variable number of events, including none open vocabulary - unlimited number of prompts, probabilities for each frame and event
- classier takes audio and text embedding, and detects whether event occurs.
- Note to self: this is actually kind of like linear classification, where the weights of the last layer are the text embedding.
Current ALMs
- temporal representations can be averaged, from the second-to-last layer of contrastive ALMs
Efficiency
- can precompute audio embeddings, and match it up to different text queries
- can be built on current ALMs
Logit Adjustment
- some classses occur more often than others, some events are longer than others
- most frame/text pairs are super negative.
- thus, there is a text-related logit bias applied to the pre-sigmoid dot product.
- dependencies are bad - what if you hear thunder, but then classifies it as rain, since rain is longer in the dataset.
Experiments
Sound event detection - graph
- dataset: a single example of audio mixture
- metrics: frame-wise prediction accuracy
- baselines:
Sound event detection performance
- dataset: synthetic open-vocabulary SED. 6 datasets
- metrics: AUROC, PSDS
- baselines: MGA-CLAP, and FLAM but global.
retrieval
- text to audio, audio to text.
- 3 datasets
- baselines: FLAM global and MGA-CLAP (retrained on same dataset). also compared to LAION CLAP, CompA, MGA-CLAP, which were trained on different datasets.
zero-shot classification
- baselines: MGA-CLAP, LAION,
Ablations
- removing per-text scale, per-text bias, and both
Last Reviewed: 7/16/2025