2 problems covered here
- Predicting 3D shapes from single images
- Processing 3D input data
Lots more topics in 3D vision
- Multi-view sterio
- Differntiable graphics
- 3D Sensors
- Simulataneous Localization and Mapping
- Etc.
3D shape representations
- Depth map
- Voxel grid
- Pointcloud
- Mesh
- Implicit Surfaces
Depth map
Predicting
- F-CNN
- L2 per-pixel loss
Issue: scale-depth ambiguity
- Can use scale invariant loss
Surface normals
- surface normal gives vector a normal vector to object in the world for that pixel
Voxel grid
Represent shape with V x V x V grid of occupancies (think- segmentation mask from Mask R-CNN, but 3D)
Pros: conceptually simple
Cons: Need high spatial resolution for fine structures + high resolution scaling is non trivial
Architecture
- Input → 2 d features → 3D Features → Upscaling → 1 x V x V x V
Voxel tubes: Final convolutional layer. V filters. Interpret as a tube of voxel scores
Problems with Voxel: shit ton of memory usage
Scaling Voxels
Oct-Trees : use heterogenous resolutions to add where necessary (finer details)
Point Coud
Pros: don’t need ton of points for fine sructures
Cons: doesn’t explicitly represent surface- would need to post-process to get a mesh
PointNet:
Generating PointCloud outputs
- Input → (2d CNN) image feautres → (FC + 2d CNN) points + points → pointcoud
Loss function: differentiable way to compare the pointclouds (as sets!)
- Chamfer distance (sum of L2 distance to each point’s nearest neighbor in other set)
Triangle mesh
Represent 3dD shape with set of triangles
- Vertices: set of V poitns in 3D space
- Faces: set of trinangles over the vertices
Pros:
- Standard representation for graphics
- Explicitly represents 3D shapes
- Adaptive
- Can attach data to vetices
Pixel2Mesh
- Input: RGB Image
- Output: triangle mesh for object
- Key ideas:
- Itertive refinement: starts with initial ellipsoid mesha nd predicts osfsets for each vertex
- Graph convolution:
- Given vertex has feature , new feature depends on feature of neighboring vertices
- use same weights and for all outputs
- Given vertex has feature , new feature depends on feature of neighboring vertices
- Vertex aligned-features
- For each vertex of the msh
- Use camera info to project onto image plane
- Use bilinear interpolation to sample CNN feature
- Similar to RoI-Align operation from OD
- For each vertex of the msh
- Chamfer loss function
- Convert mesh to point cloud and then compute the loss (Chamfer)
- Also sample points frmo teh surface of ground truth mesh (offline)
Mesh R-CNN
Goal
- Input: single RGB Image
- Output: set of detected objects and their
- Bounding box (Mask-RCNN)
- Category label (Mask-RCNN)
- Instance segmentation (Mask-RCNN)
- 3D triangle mesh (Mesh Head)
Issues with Mesh deformation- topology is fixed by the initial mesh
- Solution: use voxel predictions to create initial mesh prediction
Pipeline
- Input image → 2D object recognition → 3D object voxels → 3D object meshes
Implicit surface
Goal: function to classify arbitrary 3D points as inside / utisde the shape
- Surface would be the level set
- Signed distance function (SDF): Euclidean distance to surfaec of the shape
Extracting explicit shape outputs requires post-processing
NeRF for view synthesis
View synthesis
- Input: many images of same scence (with known camera parameters)
- Output: images from novel viewpoints
Volume rendering
Abstract away light soures, objects. For each pint in space, we need to know
- How much light does it emit?
- How opaque?
Each ray:
- Volume density:
- Color in direction d:
Volume rendering equation (color observed by camera
-
- : near point
- : current point
- : far point
- : transmittance- how much light from the current point will reach the camera?
- : Opacity- how opaque is the current point?
- : what color does current point emit along directio towards camera?
NeRF (neural radiance fields)
- Input: and
- Output: and
Archcitecture
- Fully connected network
Very strong results, but very slow