kaolin icon indicating copy to clipboard operation
kaolin copied to clipboard

Training loss in tutorial/dibr_tutorial.ipynb doesn't decease on Pascal GPU

Open lazykyama opened this issue 3 years ago • 2 comments

I found an issue about model training loss in DIBR tutorial. Could you check more details?

Problem

In contrast to an example log message on tutorial notebook, model training on Pascal GPU (in my case, Titan X (Pascal)) doesn't work well. As I described below, training loss on V100 and/or A100 decreased very well, but the loss on Titan X didn't

Titan X (Pascal):

Epoch 0 - loss: 0.995875358581543
Epoch 1 - loss: 0.9929110407829285
Epoch 2 - loss: 0.9927653074264526
Epoch 3 - loss: 0.9926865100860596
Epoch 4 - loss: 0.9877458810806274
Epoch 5 - loss: 0.9935289025306702
Epoch 6 - loss: 0.993104875087738
Epoch 7 - loss: 0.9930578470230103
Epoch 8 - loss: 0.9965534806251526
Epoch 9 - loss: 0.9983258843421936
Epoch 10 - loss: 0.9973642826080322
Epoch 11 - loss: 0.9987236261367798
Epoch 12 - loss: 0.9923691153526306
Epoch 13 - loss: 0.9902397394180298
Epoch 14 - loss: 0.9947593808174133
Epoch 15 - loss: 0.9932674765586853
Epoch 16 - loss: 0.9897530674934387
Epoch 17 - loss: 0.988956868648529
Epoch 18 - loss: 0.9974667429924011
Epoch 19 - loss: 0.9950008988380432
...

A100:

Epoch 0 - loss: 0.2732722759246826
Epoch 1 - loss: 0.24036036431789398
Epoch 2 - loss: 0.3558407723903656
Epoch 3 - loss: 0.216680645942688
Epoch 4 - loss: 0.12641094624996185
Epoch 5 - loss: 0.10339867323637009
Epoch 6 - loss: 0.09548076242208481
Epoch 7 - loss: 0.07003585249185562
Epoch 8 - loss: 0.06331620365381241
Epoch 9 - loss: 0.059887781739234924
Epoch 10 - loss: 0.046345192939043045
Epoch 11 - loss: 0.0587751530110836
Epoch 12 - loss: 0.16587351262569427
Epoch 13 - loss: 0.06599003076553345
Epoch 14 - loss: 0.04489019885659218
Epoch 15 - loss: 0.0848640576004982
Epoch 16 - loss: 0.04892460256814957
Epoch 17 - loss: 0.04731146618723869
Epoch 18 - loss: 0.06393366307020187
Epoch 19 - loss: 0.03333395719528198
...

V100:

Epoch 0 - loss: 0.2664042115211487
Epoch 1 - loss: 0.22355087101459503
Epoch 2 - loss: 0.24613362550735474
Epoch 3 - loss: 0.2849520742893219
Epoch 4 - loss: 0.2501652240753174
Epoch 5 - loss: 0.08334556221961975
Epoch 6 - loss: 0.1380174607038498
Epoch 7 - loss: 0.15450777113437653
Epoch 8 - loss: 0.0781756043434143
Epoch 9 - loss: 0.15418516099452972
Epoch 10 - loss: 0.2644664943218231
Epoch 11 - loss: 0.11432844400405884
Epoch 12 - loss: 0.06503141671419144
Epoch 13 - loss: 0.06677331030368805
Epoch 14 - loss: 0.04121328890323639
Epoch 15 - loss: 0.03857976570725441
Epoch 16 - loss: 0.03913675621151924
Epoch 17 - loss: 0.05681620538234711
Epoch 18 - loss: 0.0862089991569519
Epoch 19 - loss: 0.07951713353395462
...

Environment

  • OS
    • Ubuntu 18.04 (Kernel: 4.15.0-177-generic)
  • Docker container
    • NGC's PyTorch container (nvcr.io/nvidia/pytorch:22.04-py3)
      • CUDA: 11.6.2
      • PyTorch: 1.12.0a0+bd13bc6
      • More details: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-04.html#rel_22-04
  • GPU
    • TITAN X (Pascal)
  • Driver
    • 515.43.04

Note that, for V100 and A100, I ran the tutorial code on DGX-1 and DGX A100, respectively. Then, I also used the same docker container as a runtime environment.

Regarding Kaolin version, I checked out the repo with a commit, 2389781f8605ebce7da4096d6478c8c4c2e5d6f1, and built and installed Kaolin based on this commit like below.

git clone --recursive https://github.com/NVIDIAGameWorks/kaolin
cd kaolin
git checkout 2389781f8605ebce7da4096d6478c8c4c2e5d6f1
export IGNORE_TORCH_VER=1
pip uninstall -y Cython && pip install Cython==0.29.20 && python setup.py develop

lazykyama avatar Jun 06 '22 00:06 lazykyama

Same for me. It doesn't work on Nivdia GeForce RTX 3080.

Epoch 0 - loss: 0.9929165840148926
Epoch 1 - loss: 1.0004401206970215
Epoch 2 - loss: 0.9890572428703308
Epoch 3 - loss: 0.992308497428894
Epoch 4 - loss: 0.9881731867790222
Epoch 5 - loss: 0.9978615641593933
Epoch 6 - loss: 0.9916446208953857
Epoch 7 - loss: 0.9934905767440796
Epoch 8 - loss: 0.9890478253364563
Epoch 9 - loss: 0.9899123311042786
Epoch 10 - loss: 0.9867953658103943
Epoch 11 - loss: 0.99429851770401
Epoch 12 - loss: 0.9958875179290771
Epoch 13 - loss: 0.9942327737808228
Epoch 14 - loss: 0.9900299906730652
Epoch 15 - loss: 0.993216872215271
Epoch 16 - loss: 0.993524968624115
...

YufengJin avatar Aug 22 '22 09:08 YufengJin

@YufengJin I'd be interested to have more details, can you share with me your system infos? (Pytorch version / python version / OS / cuda version)

Caenorst avatar Aug 29 '22 18:08 Caenorst