NeMo-Curator
NeMo-Curator copied to clipboard
Check Pytorch cuda context is valid across GPUs
trafficstars
Describe the bug
We have had multiple breakages of CUDA context being only used for GPU 0 in a dask+pytorch environment. Sometimes this can occur due to a library creating a cuda context with pytorch before starting the cluster.
What ends up happening is Pytorch models being deployed on GPU-0 and that issue is hard to debug .