Videos

Lots of video models

Single Frame CNN

Good baseline

Core idea: just train a normal 2D CN to classify video frames independently!

Late Fusion

Intuition: high level appearance of each frame + combine them

Run 2D CNN on each frame → concatenate features → feed to MLP

With pooling: Run 2D CNN on each frame → pool features → feed to linear

Issue: hard to compare low level motion between frames

Early Fusion

Compare frames (temporal dimension) in first conv layer, then standard 2D CNN

  • Collapse all temporal info in first conv

No temporal shift invariance! Separate filters for same motion at different times

Isue: only one layer of emporal processing- may not be enough

3D CNN

Intuition: 3d version of convolution + pooling to slowly fuse temporal info over the course of the network

Temporal shift invariant! Each filter slides over time

Ecah layer: 4D tensor: D x T x H x W

  • 3D Conv
  • 3D Pooling

C3D: the VGG of 3D CNNs 3D CNN which uses all 3x3x3 conv and 2x2x2 pooling (also a Pool1 which is 1x2x2)

Still has issue- 3x3x3 Conv is very expensive

Idea: optical flow

Measure motion

Optical flow highlights lcoal motion

  • Where each pixel will move in the next frame

Two stream networks

Separating motion and appearance

2 inputs

  • Stack of optical flow
  • Single image

Modeling long-term temporal structure

  • So far, temporal CNNs only model local motion- what about long-term structure?

Process local features using recurrent network

Can use multi layer RNN type structure to process videos

Recurrent convolutional network

  • In normal 2D CNN: Input → (2d Conv) → Output features
  • Recurrent CNN Features from Layer L, timestep L and features from Layer L-1, timestep T → (RNN-like recurrence) → feautres for layer L, timestep t

Issue: RNNs are slow for long seuqnces (not parallelizable)

Recall: different ways of processing sequences

  • RNN: for ordered sequences (in video, CNN+RNN)
    • Pros: Good for long sequences
    • Cons: Not parallelizable
  • 1D Convolution: for multidimensional grids (in video: 3d Convolution)
    • Pros: highly parllel
    • Cons: Bad at long sequences
  • Self-Attention: for sets of vectors
    • Pros: Good for local sequences, highly parallel
    • Cons: Memory intensive

Spatio-Temporal Self-Attention TODO

Inflating 2D Networks to 3D (I3D)

  • Already lots of work done on images, can we extend to video?
  • Idea: take 2D CNN architecture + replace each 2D conv/pool with 3D version
  • Can use weights to initialize 3D conv, copy in space and divide

Vision Transformers for Video

  • Factorized Attention
  • Pooling module

Visualizing Video Models

There are more tasks than just classifying short clips

Temporal Action Localization

  • Given long untrimmed video, identify frames corresponding to actions

Spatio-Temporal Detection

  • Give long untrimmed video, dertect peoplel in space and time + classify activities