jglaser

Results 16 comments of jglaser

I do not need this functionality anymore, so will only be able to provide limited guidance. Feel free to close if noone else needs this.

I am working on related stuff, will publish as PR soon.

@jcrist I am sorry if I was sounding dismissive, I did not mean to discourage. This is a much needed project with a well thought out design. The only issue...

PR #21 fixes a few (last?) outstanding bugs in the calculation

Bill, could you specify which function requires the mutex? Yes, the number in front of the ":" is the global process rank. Eight processes per node are calling the `rsmi_init()`...

ping. Is anyone seeing this? Do you need more context?

Hi... RCCL uses rocm_smi under the hood https://github.com/ROCmSoftwarePlatform/rccl/blob/4643a17f83900dd84676fc61ebf03be0d9584d68/src/misc/rocm_smi_wrap.cc#L37-L43 pytorch uses RCCL for distributed training, and instantiates multiple processes per node when there are multiple GPUs in a node. This leads...

> Thanks to look into this. Although multiple clients can access rocm_smi_lib at the same time, some function only allow one process can be accessed at a time. The shared...

Has there been any progress on this issue? The problem is still present in rocm 5.0.2 when launching pytorch with 8 GPUs/node on OLCF crusher. ``` 15: pthread_mutex_unlock: Success 15:...

Obviously, a GPU kernel to multiply the particle forces and reduce the CV will have to be implemented as well.