NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Check Pytorch cuda context is valid across GPUs

Open VibhuJawa opened this issue 1 year ago • 3 comments
trafficstars

Describe the bug

We have had multiple breakages of CUDA context being only used for GPU 0 in a dask+pytorch environment. Sometimes this can occur due to a library creating a cuda context with pytorch before starting the cluster.

What ends up happening is Pytorch models being deployed on GPU-0 and that issue is hard to debug .

VibhuJawa avatar Oct 08 '24 18:10 VibhuJawa