examples
examples copied to clipboard
Better device handling
I can't reproduce the issue of every process allocating memory of GPU 0 (https://github.com/pytorch/examples/issues/969), so maybe the underlying issue has been fixed. Regardless, usage of torch.cuda.set_device is now discouraged in favor of just setting CUDA_VISIBLE_DEVICES. Furthermore, previously validate() and AverageMeter are trying to determine what device to use on their own. We really should just pass in the agreed-upon device.
Tested on a 8-GPU instance. I made sure that both commands below still work:
python main.py -a resnet50 --gpu 1 --evaluate --batch-size 1024 /data/ImageNet/
python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:60000' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --batch-size 1024 /data/ImageNet/
# After detaching the screen, `nvidia-smi` shows that the 2nd command results in expected, distributed GPU usage:
$ nvidia-smi
Sun Dec 1 01:01:17 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 56C P0 94W / 400W | 14949MiB / 40960MiB | 62% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:08:00.0 Off | 0 |
| N/A 49C P0 99W / 400W | 15093MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:09:00.0 Off | 0 |
| N/A 52C P0 105W / 400W | 15097MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:0A:00.0 Off | 0 |
| N/A 55C P0 103W / 400W | 15095MiB / 40960MiB | 92% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:0B:00.0 Off | 0 |
| N/A 54C P0 102W / 400W | 15095MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:0C:00.0 Off | 0 |
| N/A 50C P0 103W / 400W | 15095MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:0D:00.0 Off | 0 |
| N/A 47C P0 97W / 400W | 15095MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:0E:00.0 Off | 0 |
| N/A 49C P0 85W / 400W | 14953MiB / 40960MiB | 73% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2800326 C /usr/bin/python 14940MiB |
| 1 N/A N/A 2800327 C /usr/bin/python 15084MiB |
| 2 N/A N/A 2800328 C /usr/bin/python 15088MiB |
| 3 N/A N/A 2800329 C /usr/bin/python 15084MiB |
| 4 N/A N/A 2800330 C /usr/bin/python 15084MiB |
| 5 N/A N/A 2800331 C /usr/bin/python 15084MiB |
| 6 N/A N/A 2800332 C /usr/bin/python 15084MiB |
| 7 N/A N/A 2800333 C /usr/bin/python 14944MiB |
+-----------------------------------------------------------------------------------------+