examples icon indicating copy to clipboard operation
examples copied to clipboard

Better device handling

Open EIFY opened this issue 11 months ago • 2 comments

I can't reproduce the issue of every process allocating memory of GPU 0 (https://github.com/pytorch/examples/issues/969), so maybe the underlying issue has been fixed. Regardless, usage of torch.cuda.set_device is now discouraged in favor of just setting CUDA_VISIBLE_DEVICES. Furthermore, previously validate() and AverageMeter are trying to determine what device to use on their own. We really should just pass in the agreed-upon device.

Tested on a 8-GPU instance. I made sure that both commands below still work:

python main.py -a resnet50 --gpu 1 --evaluate --batch-size 1024 /data/ImageNet/

python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:60000' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 --batch-size 1024 /data/ImageNet/

# After detaching the screen, `nvidia-smi` shows that the 2nd command results in expected, distributed GPU usage:

$ nvidia-smi
Sun Dec  1 01:01:17 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:07:00.0 Off |                    0 |
| N/A   56C    P0             94W /  400W |   14949MiB /  40960MiB |     62%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   49C    P0             99W /  400W |   15093MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   52C    P0            105W /  400W |   15097MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  |   00000000:0A:00.0 Off |                    0 |
| N/A   55C    P0            103W /  400W |   15095MiB /  40960MiB |     92%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  |   00000000:0B:00.0 Off |                    0 |
| N/A   54C    P0            102W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  |   00000000:0C:00.0 Off |                    0 |
| N/A   50C    P0            103W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  |   00000000:0D:00.0 Off |                    0 |
| N/A   47C    P0             97W /  400W |   15095MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  |   00000000:0E:00.0 Off |                    0 |
| N/A   49C    P0             85W /  400W |   14953MiB /  40960MiB |     73%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2800326      C   /usr/bin/python                             14940MiB |
|    1   N/A  N/A   2800327      C   /usr/bin/python                             15084MiB |
|    2   N/A  N/A   2800328      C   /usr/bin/python                             15088MiB |
|    3   N/A  N/A   2800329      C   /usr/bin/python                             15084MiB |
|    4   N/A  N/A   2800330      C   /usr/bin/python                             15084MiB |
|    5   N/A  N/A   2800331      C   /usr/bin/python                             15084MiB |
|    6   N/A  N/A   2800332      C   /usr/bin/python                             15084MiB |
|    7   N/A  N/A   2800333      C   /usr/bin/python                             14944MiB |
+-----------------------------------------------------------------------------------------+

EIFY avatar Dec 01 '24 06:12 EIFY