NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues

Open VibhuJawa opened this issue 7 months ago • 0 comments
trafficstars

Description

This PR fixes : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs.

Local Test to verify this:

#!/usr/bin/env python3
"""
Test to ensure that we fail if over subscribed on GPUs and deviate from 1 GPU per worker model
"""
import time
from nemo_curator.pii.algorithm import PiiDeidentifier
from nemo_curator import get_client, __version__ as curator_version

def main() -> None:
    print(f"NeMo Curator version: {curator_version}")
    client = get_client(cluster_type="gpu", rmm_pool_size="1GB")
    time.sleep(3)  # give workers time to register

if __name__ == "__main__":
    main()

With PR:

NeMo Curator version: 0.9.0rc0.dev0
cuDF Spilling is enabled
Traceback (most recent call last):
  File "/home/nfs/vjawa/NeMo-Curator/tests/test_cluster_cuda.py", line 15, in <module>
    main()
  File "/home/nfs/vjawa/NeMo-Curator/tests/test_cluster_cuda.py", line 11, in main
    client = get_client(cluster_type="gpu", rmm_pool_size="1GB")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nfs/vjawa/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 344, in get_client
    _assert_unique_gpu_per_host(client)
  File "/home/nfs/vjawa/NeMo-Curator/nemo_curator/utils/distributed_utils.py", line 147, in _assert_unique_gpu_per_host
    raise RuntimeError(report)
RuntimeError: Duplicate GPU assignment detected!

Host: dgx11  (total workers: 8)
  GPU 0 → 8 workers
Each worker on a host must own a distinct GPU.

Without PR (No error/warnings are raised):


NeMo Curator version: 0.9.0rc0.dev0
cuDF Spilling is enabled

But nvidia-smi, looks like below:

vjawa@dgx11:~/NeMo-Curator$ nvidia-smi
Fri Apr 25 10:55:19 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0              57W / 300W |  10133MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:0A:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0              41W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB           On  | 00000000:85:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB           On  | 00000000:86:00.0 Off |                    0 |
| N/A   31C    P0              42W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB           On  | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB           On  | 00000000:8A:00.0 Off |                    0 |
| N/A   30C    P0              43W / 300W |      3MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   2093974      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093976      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093982      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093984      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093989      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093995      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2093997      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |
|    0   N/A  N/A   2094001      C   ...o_curator_nightly_25_04/bin/python3     1266MiB |

VibhuJawa avatar Apr 18 '25 20:04 VibhuJawa