geometry_processing icon indicating copy to clipboard operation
geometry_processing copied to clipboard

3D object recognition using multiview CNNs

Setup

Clone the repo, add the package to the python path, download python dependencies.

PYTHONPATH=$(pwd):$PYTHONPATH
git clone https://github.com/bradyz/geometry_processing.git

pip install -r requirements.txt
or
pip install --user -r requirements.txt

Data Dependencies

Modelnet with 25 viewpoints each - https://drive.google.com/open?id=0B0d9M5p2RxBqN0IzOXpudjMyTDQ

Our model's weights - https://drive.google.com/open?id=0B0d9M5p2RxBqMlNZOFg1YmlYR3c

Contents

  1. View Generator - take 2D projections of mesh files.
  2. Train CNN - fine tune a VGG-16 CNN on the new images.
  3. Classifier - train a SVM on the CNN features.
  4. References - papers and resources used.

View Generator

Given a model and a list of viewpoints - .png image files that correspond to 2D projection will be generated.

Preprocessing consists of centering the mesh, uniformly scaling the bounding box to a unit cube, and taking viewpoints that are centered at the centroid.

Currently there are 25 viewpoints being generated that fall around the unit sphere from 5 different phis and 5 different thetas (spherical coordinates).

Train CNN

The model used in this project is a VGG-16 with pretrained weights (ImageNet), with two additional layers fc1 (2048), fc2 (1024).

Training was done for 10 epochs on 100k training images (4000 meshes) over 10 labels of ModelNet10. The images were 224 x 224 rgb. Cross entropy loss was used in combination with a SGD optimizer with a batch size of 64. Training took approximately 5 hours a NVIDIA K40 gpu.

After training, classification accuracy, given a single pose, is at 80% on a test set of 20k images.

Classifier

The question asked is - given a mesh and several viewpoints, does it help to use all of the viewpoints (MVCNN), or does a selected subset of size k give better accuracy?

We use a one-vs-rest linear SVM, similar to MVCNN, to classify activation values of the final fc layer.

The current methods consist of the using the following (currently unimplemented) -

  • Sort by minimized entropy
  • Random K
  • FPS (farthest point selection) on sorted

References

Multi-view Convolutional Neural Networks for 3D Shape Recognition - https://arxiv.org/pdf/1505.00880.pdf Princeton ModelNet - http://modelnet.cs.princeton.edu/