Early Models

Key question: How to align langauge and vision model?

Get some similarity metric and try to align the output vectors of

  • Skip gram language model
  • Traditional visual model

Multimodal distributional semantics

  • Bag of visual words:
    • Take picture
    • Use algo (like SIFT) to find key points
    • Get feature descriptor for each key point
    • Cluster the feature descriptors with k means
    • Count how often feature descriptor occurs (ie. get the bag of words for the descriptors)
    • Concat visual word vector with text word vector
    • Apply SVD to fuse information
  • Neural version: applying deep learning to this idea
    • Concatenate features from convnets and word embeddings
    • Or- try to do skip gram prediction onto visual feature

Sentence level alignment

Image to text: captioning

Attention

Text to image

Features + Fusion

Features

Region features

Mutlimodal fusion

Early middle late

Contrastive Models

CLIP

ALIGN

Multimodal Foundation Models

VisualBERT

VilBERT

LXMERT

Supervised multimodal bitransformers

PixelBERT

UNITER

ViLT

Recommended paper: ___

  • specifics don’t matter

FLAVA: holisitc- one fundational model- approach

CoCa

Frozen

Flamingo

Perceiver Resampler

Gated XATTEN

BLIP/BLIP2

Multimodal chain of thought

KSMOS-1

Evaluation

COCO

VQA

CLEVR

Hateful memes

Winoground

Beyond Images: other modalities

There are others

  • Audio
  • Video
  • Olfactory embeddings
  • Trimodal (audio, video, text)

Gounded language learning

  • Learning language by interacting in envirnoment? Someday in the future

Text to 3D

Where to next?

  • One foundation model will rule them all
    • Parameters will be shared in interesting ways
    • modality-agnostic foundation models- read + generate multi-modally
    • Automatic aligment from unpaired unimodal data will become a big topic
  • Multimodal scaling laws
    • We will investigate tradeoffs
  • Multi-Modal RAG
    • Query encoder: will be multimodal
    • Document index: will be multimodal
    • Generator: will be multimodal
  • Beter evals + benchmarking