examples
examples copied to clipboard
Imagenet training extremely low gpu utilization
As pointed out in https://github.com/pytorch/examples/issues/164, the imagenet training gets almost zero gpu utilization, I'm using python main.py -a resnet18 /home/wangtao/imagenet/ILSVRC/Data/CLS-LOC
Epoch: [0][1760/5005] Time 1.599 (2.648) Data 1.296 (2.388) Loss 5.6502 (6.3792) Prec@1 6.250 (1.374) Prec@5 14.844 (4.918)
Epoch: [0][1770/5005] Time 0.238 (2.654) Data 0.001 (2.393) Loss 5.5388 (6.3752) Prec@1 3.906 (1.388) Prec@5 13.281 (4.957)
Epoch: [0][1780/5005] Time 0.222 (2.652) Data 0.001 (2.391) Loss 5.6422 (6.3714) Prec@1 4.688 (1.402) Prec@5 12.109 (4.997)
Epoch: [0][1790/5005] Time 2.700 (2.649) Data 2.383 (2.389) Loss 5.7257 (6.3679) Prec@1 4.297 (1.417) Prec@5 12.109 (5.038)
Epoch: [0][1800/5005] Time 1.066 (2.648) Data 0.849 (2.388) Loss 5.6143 (6.3641) Prec@1 3.516 (1.430) Prec@5 12.891 (5.078)
Epoch: [0][1810/5005] Time 0.297 (2.654) Data 0.001 (2.393) Loss 5.7683 (6.3606) Prec@1 2.734 (1.443) Prec@5 11.719 (5.119)
Epoch: [0][1820/5005] Time 0.218 (2.652) Data 0.001 (2.392) Loss 5.8934 (6.3568) Prec@1 2.344 (1.454) Prec@5 7.812 (5.158)
Epoch: [0][1830/5005] Time 2.469 (2.650) Data 2.126 (2.389) Loss 5.5614 (6.3530) Prec@1 4.297 (1.469) Prec@5 16.797 (5.204)
Epoch: [0][1840/5005] Time 0.326 (2.648) Data 0.098 (2.388) Loss 5.8356 (6.3492) Prec@1 2.734 (1.486) Prec@5 10.938 (5.248)
Epoch: [0][1850/5005] Time 0.235 (2.652) Data 0.001 (2.392) Loss 5.5058 (6.3454) Prec@1 6.250 (1.500) Prec@5 14.453 (5.284)
Epoch: [0][1860/5005] Time 0.224 (2.648) Data 0.001 (2.388) Loss 5.6114 (6.3415) Prec@1 3.906 (1.517) Prec@5 13.281 (5.331)
Epoch: [0][1870/5005] Time 3.704 (2.646) Data 3.464 (2.387) Loss 5.6540 (6.3380) Prec@1 3.125 (1.528) Prec@5 11.719 (5.370)
/home/wangtao/anaconda2/envs/tensorflow_/lib/python2.7/site-packages/PIL/TiffImagePlugin.py:764: UserWarning: Corrupt EXIF data. Expecting to read 4 bytes but only got 0.
warnings.warn(str(msg))
Epoch: [0][1880/5005] Time 0.279 (2.642) Data 0.001 (2.382) Loss 5.4274 (6.3344) Prec@1 3.906 (1.540) Prec@5 14.453 (5.410)
Epoch: [0][1890/5005] Time 0.251 (2.646) Data 0.002 (2.386) Loss 5.6548 (6.3304) Prec@1 4.688 (1.559) Prec@5 12.109 (5.457)
Epoch: [0][1900/5005] Time 0.232 (2.643) Data 0.001 (2.384) Loss 5.6261 (6.3268) Prec@1 8.984 (1.577) Prec@5 15.234 (5.500)
Epoch: [0][1910/5005] Time 6.258 (2.642) Data 6.032 (2.382) Loss 5.7049 (6.3234) Prec@1 3.516 (1.593) Prec@5 12.500 (5.539)
Epoch: [0][1920/5005] Time 0.238 (2.638) Data 0.001 (2.378) Loss 5.5728 (6.3198) Prec@1 1.172 (1.604) Prec@5 11.328 (5.576)
Epoch: [0][1930/5005] Time 0.320 (2.642) Data 0.001 (2.383) Loss 5.4732 (6.3161) Prec@1 8.984 (1.618) Prec@5 17.578 (5.615)
Epoch: [0][1940/5005] Time 0.220 (2.640) Data 0.001 (2.380) Loss 5.5701 (6.3121) Prec@1 4.688 (1.635) Prec@5 16.797 (5.661)
Epoch: [0][1950/5005] Time 0.285 (2.635) Data 0.001 (2.376) Loss 5.5285 (6.3086) Prec@1 6.250 (1.650) Prec@5 14.453 (5.698)
Epoch: [0][1960/5005] Time 0.221 (2.633) Data 0.001 (2.374) Loss 5.3744 (6.3045) Prec@1 6.250 (1.667) Prec@5 18.750 (5.740)
Epoch: [0][1970/5005] Time 0.283 (2.638) Data 0.001 (2.379) Loss 5.6604 (6.3011) Prec@1 4.297 (1.680) Prec@5 13.281 (5.781)
Epoch: [0][1980/5005] Time 0.220 (2.636) Data 0.001 (2.377) Loss 5.5954 (6.2976) Prec@1 3.906 (1.693) Prec@5 15.234 (5.820)
Epoch: [0][1990/5005] Time 0.226 (2.632) Data 0.001 (2.373) Loss 5.6544 (6.2938) Prec@1 4.297 (1.709) Prec@5 12.500 (5.862)
Epoch: [0][2000/5005] Time 0.659 (2.629) Data 0.375 (2.371) Loss 5.5378 (6.2900) Prec@1 3.906 (1.729) Prec@5 16.406 (5.909)
Epoch: [0][2010/5005] Time 0.245 (2.634) Data 0.001 (2.376) Loss 5.5171 (6.2864) Prec@1 3.906 (1.745) Prec@5 13.281 (5.950)
Epoch: [0][2020/5005] Time 0.230 (2.630) Data 0.001 (2.372) Loss 5.4883 (6.2826) Prec@1 5.469 (1.761) Prec@5 15.234 (5.998)
Epoch: [0][2030/5005] Time 0.225 (2.628) Data 0.001 (2.370) Loss 5.5814 (6.2790) Prec@1 2.734 (1.777) Prec@5 12.500 (6.044)
Epoch: [0][2040/5005] Time 0.234 (2.626) Data 0.001 (2.367) Loss 5.4643 (6.2754) Prec@1 7.422 (1.792) Prec@5 15.625 (6.084)
Epoch: [0][2050/5005] Time 0.296 (2.632) Data 0.001 (2.374) Loss 5.5963 (6.2717) Prec@1 6.250 (1.807) Prec@5 14.844 (6.127)
Epoch: [0][2060/5005] Time 0.231 (2.628) Data 0.001 (2.370) Loss 5.6223 (6.2683) Prec@1 3.906 (1.822) Prec@5 11.719 (6.165)
Epoch: [0][2070/5005] Time 0.293 (2.626) Data 0.001 (2.368) Loss 5.6465 (6.2651) Prec@1 4.688 (1.832) Prec@5 12.109 (6.204)
Epoch: [0][2080/5005] Time 0.260 (2.622) Data 0.002 (2.364) Loss 5.5126 (6.2617) Prec@1 5.469 (1.848) Prec@5 14.453 (6.243)
Epoch: [0][2090/5005] Time 0.272 (2.629) Data 0.002 (2.370) Loss 5.5466 (6.2584) Prec@1 5.078 (1.863) Prec@5 13.672 (6.283)
Epoch: [0][2100/5005] Time 0.218 (2.626) Data 0.001 (2.368) Loss 5.4685 (6.2547) Prec@1 3.516 (1.881) Prec@5 16.797 (6.328)
Epoch: [0][2110/5005] Time 0.251 (2.624) Data 0.001 (2.366) Loss 5.4764 (6.2512) Prec@1 4.297 (1.893) Prec@5 16.016 (6.368)
Epoch: [0][2120/5005] Time 0.296 (2.620) Data 0.001 (2.362) Loss 5.7063 (6.2478) Prec@1 2.344 (1.911) Prec@5 14.062 (6.413)
Epoch: [0][2130/5005] Time 0.261 (2.625) Data 0.001 (2.367) Loss 5.5580 (6.2445) Prec@1 6.250 (1.924) Prec@5 13.672 (6.451)
Epoch: [0][2140/5005] Time 0.263 (2.623) Data 0.002 (2.365) Loss 5.5810 (6.2409) Prec@1 3.516 (1.941) Prec@5 12.891 (6.494)
Epoch: [0][2150/5005] Time 0.236 (2.619) Data 0.001 (2.361) Loss 5.6755 (6.2378) Prec@1 5.078 (1.955) Prec@5 14.062 (6.532)
Epoch: [0][2160/5005] Time 0.221 (2.617) Data 0.001 (2.359) Loss 5.7032 (6.2345) Prec@1 3.906 (1.970) Prec@5 10.547 (6.568)
Epoch: [0][2170/5005] Time 0.286 (2.622) Data 0.002 (2.364) Loss 5.5394 (6.2312) Prec@1 3.516 (1.985) Prec@5 13.672 (6.611)
Epoch: [0][2180/5005] Time 0.245 (2.618) Data 0.002 (2.360) Loss 5.4341 (6.2276) Prec@1 9.766 (2.000) Prec@5 18.359 (6.653)
Your dataloaders are taking up the bulk of the running time. It's likely that you are not assigning enough CPUs for the workers. Try e.g. at least 32 CPUs for 8 GPUs and 8 workers.
the same problem that dataloader time is unstable. How did you finally solve it?