CLIP

CLIP loss:

numerator = exp(audio dot text)
denominator = sum(exp(audio dot text), for all pairs)
take the log of this, and average across batch
- when averaging across batch, consider examples where you fix the current example’s text and vary the audio, and examples where you fix the current example’s audio compare against different text
typically there is a learnable logit scale, which multiplies the dot products. this is e^(alpha’) where alpha’ is the learnable parameter.

Last Reviewed: 7/15/2025