3D

2 problems covered here

Predicting 3D shapes from single images
Processing 3D input data

Lots more topics in 3D vision

Multi-view sterio
Differntiable graphics
3D Sensors
Simulataneous Localization and Mapping
Etc.

3D shape representations

Depth map
Voxel grid
Pointcloud
Mesh
Implicit Surfaces

Depth map

Predicting

F-CNN
L2 per-pixel loss

Issue: scale-depth ambiguity

Can use scale invariant loss

Surface normals

surface normal gives vector a normal vector to object in the world for that pixel

Voxel grid

Represent shape with V x V x V grid of occupancies (think- segmentation mask from Mask R-CNN, but 3D)

Pros: conceptually simple

Cons: Need high spatial resolution for fine structures + high resolution scaling is non trivial

Architecture

Input → 2 d features → 3D Features → Upscaling → 1 x V x V x V

Voxel tubes: Final convolutional layer. V filters. Interpret as a tube of voxel scores

Problems with Voxel: shit ton of memory usage

Scaling Voxels

Oct-Trees : use heterogenous resolutions to add where necessary (finer details)

Point Coud

Pros: don’t need ton of points for fine sructures

Cons: doesn’t explicitly represent surface- would need to post-process to get a mesh

PointNet:

Generating PointCloud outputs

Input → (2d CNN) image feautres → (FC + 2d CNN) points + points → pointcoud

Loss function: differentiable way to compare the pointclouds (as sets!)

Chamfer distance (sum of L2 distance to each point’s nearest neighbor in other set)
- $d_{CD} (S_{1}, S_{2}) = \sum_{x \in S_{1}} y \in S_{2} min ∣∣ x - y ∣ ∣_{2}^{2} + \sum_{y \in S_{2}} x \in S_{1} min ∣∣ x - y ∣ ∣_{2}^{2}$

Triangle mesh

Represent 3dD shape with set of triangles

Vertices: set of V poitns in 3D space
Faces: set of trinangles over the vertices

Pros:

Standard representation for graphics
Explicitly represents 3D shapes
Adaptive
Can attach data to vetices

Pixel2Mesh

Input: RGB Image
Output: triangle mesh for object
Key ideas:
- Itertive refinement: starts with initial ellipsoid mesha nd predicts osfsets for each vertex
- Graph convolution:
  - Given vertex $v_{i}$ has feature $f_{i}$ , new feature $f ’_{i}$ depends on feature of neighboring vertices
    - $f_{i} ’ = W_{0} f_{i} + \sum_{j \in N (i)} W_{1} f_{j}$
  - use same weights $W_{0}$ and $W_{1}$ for all outputs
- Vertex aligned-features
  - For each vertex of the msh
    - Use camera info to project onto image plane
    - Use bilinear interpolation to sample CNN feature
    - Similar to RoI-Align operation from OD
- Chamfer loss function
  - Convert mesh to point cloud and then compute the loss (Chamfer)
  - Also sample points frmo teh surface of ground truth mesh (offline)

Mesh R-CNN

Goal

Input: single RGB Image
Output: set of detected objects and their
- Bounding box (Mask-RCNN)
- Category label (Mask-RCNN)
- Instance segmentation (Mask-RCNN)
- 3D triangle mesh (Mesh Head)

Issues with Mesh deformation- topology is fixed by the initial mesh

Solution: use voxel predictions to create initial mesh prediction

Pipeline

Input image → 2D object recognition → 3D object voxels → 3D object meshes

Implicit surface

Goal: function to classify arbitrary 3D points as inside / utisde the shape

Surface would be the level set ${x : o (x) = \frac{1}{2}}$
Signed distance function (SDF): Euclidean distance to surfaec of the shape

Extracting explicit shape outputs requires post-processing

NeRF for view synthesis

View synthesis

Input: many images of same scence (with known camera parameters)
Output: images from novel viewpoints

Volume rendering

Abstract away light soures, objects. For each pint in space, we need to know

How much light does it emit?
How opaque? $σ \in [0, 1]$

Each ray: $r (t) = o + t d$

Volume density: $σ (p) \in [0, 1]$
Color in direction d: $c (p, d) \in [0, 1]^{3}$

Volume rendering equation (color observed by camera

$C (r) = \int_{t_{n}}^{t_{f}} T (t) σ (r (t)) c (r (t), d) d t$
- $t_{n}$ : near point
- $t$ : current point
- $t_{f}$ : far point
- $T (t)$ : transmittance- how much light from the current point will reach the camera?
- $σ$ : Opacity- how opaque is the current point?
- $c$ : what color does current point emit along directio towards camera?

NeRF (neural radiance fields)

Input: $p$ and $d$
Output: $σ$ and $c$

Archcitecture

Fully connected network

Very strong results, but very slow

Pablo's Reference Notes

Explorer

3D

Depth map

Voxel grid

Point Coud

Triangle mesh

Implicit surface

NeRF for view synthesis

Graph View

Table of Contents

Backlinks