Similarity-Based Methods

Some data

use another network to extract some target

compare similarity between target and data,

SimCLR, MoCo, JEPA, BYOL, SWAV, SimSiam, Dino

In depth

Want to organize data

what kind of inductive biases can we design?

Information retrieval

information retrieval - like a library - organized to findinformation faster

corpus of N documents - long pieces of text, PDFs. Goal is search, find top-k most relevant given query.

k « N

succcess @ 1 = anything in set is good precision @ k - fractcion of things that are relevant that you have retried recalll - fraction of relevant items in top K / total known relevant items.

must have sub-linear latency

before 2019 - matching keywords

Transforms:

N-way regression? huge computation — too many corpus weights?

memory/latent due to a huge matrix multiplication
no connection between documents, if you learn somethnig about document 7, doesn’t transfer. representations not built on top of the content of the documents. in CIFAR you at least have more stuff in classses

Need a more compositional approach

feed all docs to transformer (along with query?)
map document last token to a score
O(N^2) attention costs

Decomposition:

pointwise scoring - score each document separated.
query and one document, and similarity
cross atention
label 1-2 relevant documents for each query
train given a single query, single document (decompose more)
for each q, sample d
problems - still expensive, O(N) for each query
if you have some way to get the top-10*K, you can re-rank them to get the best K there - as long as initial algorithm has good reaccll
probably requires good representations, will learn to understand.

Better latency?

represent the document once, to get a single vector per document.

Dual encoders

vector similiary
Q, D
encoder two ppl, do dot product.
similar to classification - since the classification head matrix will have a vector per document
but now the document represnentations aren’t being learned independently.
query time - one transformer forward pass (but N different dot product)
pointwise not great, hard to get interesting scores.
constrastive lossses - not exact score, but to make sure they’re ordered well.
distance of dissimilar » distsance of similar
InfoNCE loss - project embeddings onto hypersphere, cross entropy loss with a softmax as classifier

sample negative documents randomly?

might try to mine for harder negatives

proof

if dimensions are big enough, and encoder is big enough, we can approximate any continous scoring function.

dot products are very expressive

Dimensinons

can design retrieval problems where in order for vector representations to work, need very large dimension (millions) to even be expressive with dual encoders

larger embedding dimensions are better, but it’s a loc scale

can get zero trainig loss, but not generalize.

Last Reviewed: 10/26/2025