wav2vec

learn from speech audio alone

masks speech in latent space, solves a contrastive task, defined over a quantization of latent representations, which are jointly learned

uses a contrastive task

1-hour of labeled data, still beats 100 hours

can work with just 10 minutes of labeled data, if 53k hours of pretraining data

CNN, then mask spans of the CNN output, then transformer

Finetuned with CTC loss later

also see vq-wav2vec

CNN: Audio -> Z

transformer: Z -> C

quantizer: Z -> Q

raw audio (not spectrogram), layer norm, GELU

instead of positional embeddings, use convolutional layer (i guess this is like, a convolution between adjacent tokens) - GeLU this, then apply layernorm

use product quantization on the zs, then linear projection

differentiate through the quantization layer using gumbel

training

mask in Zs, training objective predicts the correct latent quantized category for masked tokens

Last Reviewed: 10/31/25