darts icon indicating copy to clipboard operation
darts copied to clipboard

Imagenet training time?

Open tonystark940501 opened this issue 6 years ago • 2 comments

Hi @quark0 , I run train_imagenet.py. However, the speed seems to be so slow that it took nearly 12 hours to train the first epoch on a Titan Xp. I copy some of the log here: 01/02 02:15:44 PM param size = 4.718752MB 01/02 02:18:25 PM epoch 0 lr 1.000000e-01 01/02 02:18:44 PM train 000 9.666612e+00 0.000000 0.000000 ... 01/03 01:41:33 AM train 9900 7.790667e+00 9.245673 22.546018 01/03 01:46:48 AM train 10000 7.779901e+00 9.340941 22.717884 01/03 01:47:13 AM train_acc 9.347728 01/03 01:47:25 AM valid 000 2.627861e+00 46.875000 75.000000 01/03 01:51:20 AM valid 100 3.575097e+00 24.234220 50.092822 01/03 01:55:15 AM valid 200 3.782414e+00 21.968284 46.618470 01/03 01:59:07 AM valid 300 3.929313e+00 21.072986 44.463767 01/03 02:02:35 AM valid_acc_top1 20.944000 01/03 02:02:35 AM valid_acc_top5 43.834000 01/03 02:02:36 AM epoch 1 lr 9.700000e-02

You can see that from 01/02 02:18:25 PM to 01/03 02:02:36 AM, it took 12 hours to train the first epoch. Can I get some help from you?

tonystark940501 avatar Jan 03 '19 03:01 tonystark940501

Hi @tonystark940501, I train ImageNet dataset with 2 K80 GPU, not slow, Each epoch use ~3 hours,

2019-01-22 13:37:49,194 epoch 0 lr 1.000000e-01 2019-01-22 13:38:07,415 train 000 9.681139e+00 0.000000 0.781250 ... 2019-01-22 16:16:18,836 train 5000 8.394089e+00 4.498241 13.014194 2019-01-22 16:16:31,526 train_acc 4.503082 2019-01-22 16:16:37,187 valid 000 3.657096e+00 31.640625 62.109375 2019-01-22 16:18:15,691 valid 100 4.629506e+00 11.637531 29.470916 2019-01-22 16:19:46,688 valid_acc_top1 11.414000 2019-01-22 16:19:46,728 valid_acc_top5 28.660000 2019-01-22 16:19:46,959 epoch 1 lr 9.700000e-02 2019-01-22 16:19:55,304 train 000 7.320449e+00 12.890625 30.468750 ... 2019-01-22 18:58:28,874 train 5000 6.641470e+00 18.039595 38.859025 2019-01-22 18:58:35,694 train_acc 18.043862 2019-01-22 18:58:41,219 valid 000 2.488467e+00 47.265625 76.562500 2019-01-22 19:00:19,756 valid 100 3.537449e+00 24.798886 51.295637 2019-01-22 19:01:54,287 valid_acc_top1 24.274000 2019-01-22 19:01:54,287 valid_acc_top5 48.996000 2019-01-22 19:01:54,584 epoch 2 lr 9.409000e-02

You installed package of python version? Pytorch version? CUDA version? and CUDNN version?

chakkritte avatar Jan 24 '19 06:01 chakkritte

Hi @tonystark940501, I train ImageNet dataset with 2 K80 GPU, not slow, Each epoch use ~3 hours,

2019-01-22 13:37:49,194 epoch 0 lr 1.000000e-01 2019-01-22 13:38:07,415 train 000 9.681139e+00 0.000000 0.781250 ... 2019-01-22 16:16:18,836 train 5000 8.394089e+00 4.498241 13.014194 2019-01-22 16:16:31,526 train_acc 4.503082 2019-01-22 16:16:37,187 valid 000 3.657096e+00 31.640625 62.109375 2019-01-22 16:18:15,691 valid 100 4.629506e+00 11.637531 29.470916 2019-01-22 16:19:46,688 valid_acc_top1 11.414000 2019-01-22 16:19:46,728 valid_acc_top5 28.660000 2019-01-22 16:19:46,959 epoch 1 lr 9.700000e-02 2019-01-22 16:19:55,304 train 000 7.320449e+00 12.890625 30.468750 ... 2019-01-22 18:58:28,874 train 5000 6.641470e+00 18.039595 38.859025 2019-01-22 18:58:35,694 train_acc 18.043862 2019-01-22 18:58:41,219 valid 000 2.488467e+00 47.265625 76.562500 2019-01-22 19:00:19,756 valid 100 3.537449e+00 24.798886 51.295637 2019-01-22 19:01:54,287 valid_acc_top1 24.274000 2019-01-22 19:01:54,287 valid_acc_top5 48.996000 2019-01-22 19:01:54,584 epoch 2 lr 9.409000e-02

You installed package of python version? Pytorch version? CUDA version? and CUDNN version?

3 hours per epoch,that means you need about a month for training, it is much longer than it mentioned in the paper,which is 12 days(a single GPU). what do you think

thinkInJava33 avatar Mar 16 '20 03:03 thinkInJava33