jglaser
jglaser
I do not need this functionality anymore, so will only be able to provide limited guidance. Feel free to close if noone else needs this.
I am working on related stuff, will publish as PR soon.
@jcrist I am sorry if I was sounding dismissive, I did not mean to discourage. This is a much needed project with a well thought out design. The only issue...
PR #21 fixes a few (last?) outstanding bugs in the calculation
Bill, could you specify which function requires the mutex? Yes, the number in front of the ":" is the global process rank. Eight processes per node are calling the `rsmi_init()`...
ping. Is anyone seeing this? Do you need more context?
Hi... RCCL uses rocm_smi under the hood https://github.com/ROCmSoftwarePlatform/rccl/blob/4643a17f83900dd84676fc61ebf03be0d9584d68/src/misc/rocm_smi_wrap.cc#L37-L43 pytorch uses RCCL for distributed training, and instantiates multiple processes per node when there are multiple GPUs in a node. This leads...
> Thanks to look into this. Although multiple clients can access rocm_smi_lib at the same time, some function only allow one process can be accessed at a time. The shared...
Has there been any progress on this issue? The problem is still present in rocm 5.0.2 when launching pytorch with 8 GPUs/node on OLCF crusher. ``` 15: pthread_mutex_unlock: Success 15:...
Obviously, a GPU kernel to multiply the particle forces and reduce the CV will have to be implemented as well.