Contrastive Learning is a self-supervised learning paradigm that teaches a model to distinguish between similar and dissimilar data points. Instead of training a model to map an image to a fixed label (e.g., "Dog"), contrastive learning trains the model to ensure that two different views of the same dog are represented by similar vectors, while a view of a cat is represented by a distant vector.

The Objective: Similarity as Distance

The core goal is to learn an embedding function $f(x)$ that maps raw data into a high-dimensional space where distance correlates with semantic similarity.

If $x_i$ and $x_j$ are similar, the cosine similarity of $f(x_i)$ and $f(x_j)$ should be close to 1. If they are different, it should be close to 0.

InfoNCE and the Contrastive Loss

The most common loss function used is InfoNCE (Information Noise-Contrastive Estimation). It treats the problem as a multi-class classification task where, given an anchor, the model must identify the single positive sample among many negatives:

$\mathcal{L} = -\log \frac{\exp(\text{sim}(z_a, z_p) / \tau)}{\sum_{i=0}^{K} \exp(\text{sim}(z_a, z_i) / \tau)}$

Here, $\tau$ is a temperature parameter that controls the "sharpness" of the distribution. By minimizing this loss, the model effectively "pulls" the positive pair together and "pushes" the negatives away in the latent space.

Applications in Multimodal AI

Contrastive learning is the foundation of models like CLIP (Contrastive Language-Image Pre-training). In CLIP, the model is given a batch of image-text pairs. It uses a text encoder and an image encoder to project both into a shared space, then uses a contrastive loss to ensure the correct text matches the correct image.

This approach is powerful because it allows models to learn from raw, unlabelled data found on the internet (e.g., images with captions) rather than requiring manually annotated datasets. It creates a "universal" understanding that can be applied to zero-shot classification and search.

The Logic of Contrastive Learning

The Objective: Similarity as Distance

InfoNCE and the Contrastive Loss

Applications in Multimodal AI

Frequently Asked Questions

Join the EulerFold community

Recommended Readings