Goa: learn from data withouot manual label annotation
- Solve “pretext” tasks which produce good features for downstream tasks
Some pretext tasks
- Predict image transformations
- Rotation prediction
- Jigsaw puzzle
- Complete corrupted images
- Image complextion
- Colorization
Evaluation
- Don’t actually care about performance of the self-supervised tasks
- Evaluate on downstream target tasks instead
Thus, the steps are
- Learn good feature extractors from self-supervised pretext tasks
- Attach shallow netowrk to feature extractor
- Train this shallow network on target task with small amount of data
Pretext Tasks from Image Transformations
Pretext task: Predict Rotations
Idea: model can recognize correct rotation of an object if it has “visual commensense”
Learns classification
- 90 / 180 / 270 / 0
Pretext task: predict relative patch locations Given center patch and another patch from its context, predict where in grid the second patch comes from
Pretext task: solve jigsaw puzzles Given grid of 9 patches, order
Pretext task: image coloring
Pretext task: video coloring
Idea: model the temporal cohrrence of colors in videos (eg car continues to be red)
Hypothesis: learning to color video frmaes should allo wmodel to track regions or objects without labels
Learning objective: establish mappings between reference and target frames in a learned feature space
Learning
- Attention map on the reference frame
- Predicted color:
- Weighted sum of the reference color
Colorization → tracking
- Use the leraned attentino to
- Propagate segmenttion masks
- Propagate pose keypoints
Key points
- Pretext tasks focus on “visual common sense”
- Models forced to learn good features
- Don’t care about performance on pretext tasks
- Problems
- Coming up with individual pretext tasks is tedious
- Learnings may not be general
- More general pretext task?
- Contrastive representation learning
Contrastive Representation Learning
Key idea: score >> score
- x: reference sample
- : positive sample
- Can be transformed with pretext tasks?
- : negative samle
Given score function
- Learn encoder function which yeields high score for and low scores for negative pairs
Loss function (given 1 positive sample and N-1 negative samples)
- Known as the InfoNCE Loss
- Lower bound on the mutual info between and
SimCLR: Basic framework for Contrastive Learning
- Use cosine similarity as the score function
- Use projection network
- Project features to pspace where contrastive learning is applied
- Positive samples from data aug
- Cropping, color distortion, blur
- Mini-batch training
- To use for downstream applicatoins
- Train feature encoder on large dataset (eg imagenet) using SimCLR
- Then- freeze feature encoder + train linear classificaiton (or other purpose) on top with labeled data
- Design choices for SimCLR
- Projection head: linear / non-linear projection heads improve representation learning
- Why?
- Maybe-
- The contrastive learning objective discards useful information, and representation space is trained to be invariant to data transformation
- Projection head may let more info be preserved in the representation space
- Large batch size: Crucial!
- Projection head: linear / non-linear projection heads improve representation learning
Pseudocode
for given minibatch
- for all instances
- Generative positive pairs by sampling data augmentation functions
- Iterate through and use each of the 2N samples as reference, compute average loss
- InfoNCE loss
Momentum Contrastive Learning (MoCo)
Differences to SimCLR:
- Running queu of negative samples (keys)
- Computes gradients + updates encoder only through the queries
- Decouples mini-batch size from number of keys
- Can have large number of negative samples
- Key encoder slowly progresses using momentum update rules
Moco V2
Hybrid of SimCLR + MoCo
- SimCLR: non-linear projection head + strong data aug
- MoC: momentum-updated queus which alow training on large number of negative samples
When comparing the 3 some takeaways stand out
- nonlinear projection head and strong dat aug are crucial for constrastiv elearning
- Decoupling mini-batch size w/ negative sampling size lets Moco-V2 outperform SimCLR with smaller batches
- with smaller memory footprint
Instance vs Sequence Contrastive Learning
Instance-level
- Positve / negative instances
- e.g. SimCLR, MoCo
Sequence level
- Sequential / temporal orders
- e.g. Contrastive Predictive Coding
Contrastive Predictive Coding term by term
- Contrastive: right vs wrong sequneces
- Predictive: predicts future patterns given current context
- Coding: learns feature vecotrs (ie code) for downstream tasks
Steps
- Encode all samples in sequence to vectors
- Summarize the context into context code
- Use an auto-regressive model
- Original paper uses GRU-RNN
- Get loss between context and future code using time-dependent score function
- trainable matrix
- Loss:
Example use case
- Speaker classification
Takeaways- contrastive representation learning
General formulation: score >> score
InfoNCE loss: n-way classification between positive / negative samples
SimCLR: framework for crl
- Non-linear projection head: flexible learning
- Simple + effective
- Large memory footprint
MoCo: contrastive learning w/ momentum sample encoder
- Decouple negative sample size from minibatch size with queue
MoCo v2
- Combines nonlinear projection head, strong data aug, w/ momentum contrastive learning
CPC: sequence level contrastive learning
- Right vs wrong squence
- InfoNCE loss w/ time dependent score function
Some other examples to think about
- CLIP: contastive learning between image + natural language
- Dense Object Net: contastive learning on pixel-wise feature descriptors