Mason Wang

cross attention vs dual independent encoders - cross attention will fit prefectly

dual encoder is as good on training set, but worse on the test set.

blunt representations = blunt updates. if you push it away from negative documents, it will be closer to other negative documents.

Late interaction

cross attention not scalable, but has interaction

we still have independent encodings but we have a sequence of small vectors (4 bytes each) then do MAX similarity between small vectors

sublinear search, can do ANN but maximums can be shared across documents

max sim between each element of query and

works best as found in Althammer et al. 2023.

Comparison to dot-product approach:

more data, improves fast.

Compositional updates - gradients flows through things that match (decommposed).

Key idea

weight Updates actually HELP other documents, instead of pushing it closer to other negative examples, since some tokens may be shared with other examples

ColBert Comparison

Last Reviewed: 10/26/2025