SIGLIP

instead of a InfoNCE loss, we do a binary cross entropy loss on positive or negative pairs.

the probability is computed as a dot product between the embeddings, times a weight plus a bias

bias initialization is important, since there is imbalance between positive and negative examples.