Videos

Lots of video models

Single Frame CNN

Good baseline

Core idea: just train a normal 2D CN to classify video frames independently!

Late Fusion

Intuition: high level appearance of each frame + combine them

Run 2D CNN on each frame → concatenate features → feed to MLP

With pooling: Run 2D CNN on each frame → pool features → feed to linear

Issue: hard to compare low level motion between frames

Early Fusion

Compare frames (temporal dimension) in first conv layer, then standard 2D CNN

Collapse all temporal info in first conv

No temporal shift invariance! Separate filters for same motion at different times

Isue: only one layer of emporal processing- may not be enough

3D CNN

Intuition: 3d version of convolution + pooling to slowly fuse temporal info over the course of the network

Temporal shift invariant! Each filter slides over time

Ecah layer: 4D tensor: D x T x H x W

3D Conv
3D Pooling

C3D: the VGG of 3D CNNs 3D CNN which uses all 3x3x3 conv and 2x2x2 pooling (also a Pool1 which is 1x2x2)

Still has issue- 3x3x3 Conv is very expensive

Idea: optical flow

Measure motion

Optical flow highlights lcoal motion

Where each pixel will move in the next frame

Two stream networks

Separating motion and appearance

2 inputs

Stack of optical flow
Single image

Modeling long-term temporal structure

So far, temporal CNNs only model local motion- what about long-term structure?

Process local features using recurrent network

Can use multi layer RNN type structure to process videos

Recurrent convolutional network

In normal 2D CNN: Input → (2d Conv) → Output features
Recurrent CNN Features from Layer L, timestep L and features from Layer L-1, timestep T → (RNN-like recurrence) → feautres for layer L, timestep t

Issue: RNNs are slow for long seuqnces (not parallelizable)

Recall: different ways of processing sequences

RNN: for ordered sequences (in video, CNN+RNN)
- Pros: Good for long sequences
- Cons: Not parallelizable
1D Convolution: for multidimensional grids (in video: 3d Convolution)
- Pros: highly parllel
- Cons: Bad at long sequences
Self-Attention: for sets of vectors
- Pros: Good for local sequences, highly parallel
- Cons: Memory intensive

Spatio-Temporal Self-Attention TODO

Inflating 2D Networks to 3D (I3D)

Already lots of work done on images, can we extend to video?
Idea: take 2D CNN architecture + replace each 2D conv/pool with 3D version
Can use weights to initialize 3D conv, copy in space and divide

Vision Transformers for Video

Factorized Attention
Pooling module

Visualizing Video Models

There are more tasks than just classifying short clips

Temporal Action Localization

Given long untrimmed video, identify frames corresponding to actions

Spatio-Temporal Detection

Give long untrimmed video, dertect peoplel in space and time + classify activities

Pablo's Reference Notes

Explorer

Video Understanding