packnet-sfm icon indicating copy to clipboard operation
packnet-sfm copied to clipboard

How to reproduce the result on DDAD

Open csBob123 opened this issue 4 years ago • 7 comments

Hi, Thank you for releasing the code. I am trying to train the packet on DDAD. But I can not reproduce the result so far. I use 8 v100 gpus. The training command is 'CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 horovodrun -np 8 -H localhost:8 python scripts/train.py ./configs/train_ddad.yaml' . The details of my config are as follows: model: name: 'SelfSupModel' optimizer: name: 'Adam' depth: lr: 0.00009 pose: lr: 0.00009 scheduler: name: 'StepLR' step_size: 30 gamma: 0.5 depth_net: name: 'PackNet01' version: '1A' pose_net: name: 'PoseNet' version: '' params: crop: '' min_depth: 0.0 max_depth: 200.0 datasets: augmentation: image_shape: (384, 640) train: batch_size: 8 num_workers: 8 dataset: ['DGP'] path: ['/data/ddad_train_val/ddad.json'] split: ['train'] depth_type: ['lidar'] cameras: [['camera_01']] repeat: [5] validation: num_workers: 8 dataset: ['DGP'] path: ['/data/ddad_train_val/ddad.json'] split: ['val'] depth_type: ['lidar'] cameras: [['camera_01']] test: num_workers: 8 dataset: ['DGP'] path: ['/data/ddad_train_val/ddad.json'] split: ['val'] depth_type: ['lidar'] cameras: [['camera_01']] checkpoint: filepath: './data/experiments' monitor: 'abs_rel_pp_gt' monitor_index: 0 mode: 'min'

[0]:| [2m[1m[32mE: 50 BS: 8 - SelfSupModel LR (Adam): Depth 4.50e-05 Pose 4.50e-05[0m | [0]:|| [0]:| METRIC | abs_rel | sqr_rel | rmse | rmse_log | a1 | a2 | a3 | [0]:|| [0]:| [1m[35m*** /data/ddad_train_val/ddad.json/val (camera_01) [0m | [0]:|*********************************************************************************************| [0]:| [36mDEPTH | 0.853 | 23.485 | 37.371 | 2.022 | 0.002 | 0.005 | 0.008 [0m | [0]:| [36mDEPTH_PP | 0.853 | 23.542 | 37.468 | 2.025 | 0.002 | 0.004 | 0.008 [0m | [0]:| [36mDEPTH_GT | 0.268 | 12.451 | 19.267 | 0.333 | 0.705 | 0.869 | 0.936 [0m | [0]:| [36mDEPTH_PP_GT | 0.257 | 11.199 | 18.532 | 0.324 | 0.709 | 0.873 | 0.939 [0m |

Are there any problems? Thank you for your attention.

csBob123 avatar May 21 '21 10:05 csBob123

Hmm, can you try a few things:

  • Start from a pre-trained model (e.g. a KITTI model) to see if it diverges
  • Try another network (DepthResNet or PoseResNet)
  • Play around with the learning rate

By the way, once you get some numbers you can try submitting to our EvalAI DDAD challenge! https://eval.ai/web/challenges/challenge-page/902/overview

VitorGuizilini-TRI avatar May 21 '21 15:05 VitorGuizilini-TRI

Hmm, can you try a few things:

* Start from a pre-trained model (e.g. a KITTI model) to see if it diverges

* Try another network (DepthResNet or PoseResNet)

* Play around with the learning rate

By the way, once you get some numbers you can try submitting to our EvalAI DDAD challenge! https://eval.ai/web/challenges/challenge-page/902/overview

Do you use any pre-trained weights to get the result 0.173(abs_rel) on DDAD and 0.111(abs_rel) on KITTI? Or just train from scratch?

csBob123 avatar May 21 '21 16:05 csBob123

No, those are trained from scratch with PackNet. I just mentioned pre-trained weights as a way to see if there is anything wrong with the training setup that you are using.

VitorGuizilini-TRI avatar May 26 '21 15:05 VitorGuizilini-TRI

Hi, Thanks for your work. Was the results on DDAD produced by training from scratch using the config setup provided here? https://github.com/TRI-ML/packnet-sfm/blob/master/configs/train_ddad.yaml

a1600012888 avatar Jun 01 '21 11:06 a1600012888

@a1600012888 Yes, that configuration file should work.

VitorGuizilini-TRI avatar Jun 01 '21 16:06 VitorGuizilini-TRI

@a1600012888 Yes, that configuration file should work.

Thanks!

a1600012888 avatar Jun 02 '21 16:06 a1600012888

Hi, Thanks for your work. Was the results on DDAD produced by training from scratch using the config setup provided here? https://github.com/TRI-ML/packnet-sfm/blob/master/configs/train_ddad.yaml

Hi, for DDAD experiments, Did you train the model using 8 gpu cards with this config file? If so, does that means the effective batch size is 8*2=16, and learning rate is 9e-5?

a1600012888 avatar Jun 03 '21 14:06 a1600012888