Pak Markthub
Pak Markthub
@wenduwan We are working on the next release and will release it soon. For now, you can checkout the latest commit of branch R2.3.1.
@wenduwan Yes, it should. Let us know if you still see the issue.
Hi @hassanbabaie, I am not familiar with Kubernetes. Let me find out if someone might know the answer.
Hi @hassanbabaie, FYI, we have released a gdrdrv container image on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cloud-native/containers/gdrdrv. Running that image will automatically compile and install the gdrdrv driver on your system. It will also...
Hi @pandyamarut, Based on your question, my guess is that your application does not use GDRCopy directly. Probably you want to confirm that a library (e.g., UCX, NCCL) is properly...
Hi @tylerjereddy, I suspect that the segfault is from somewhere in https://github.com/ofiwg/libfabric/blob/main/src/hmem_cuda_gdrcopy.c#L346-L380 or https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrapi.c#L387-L411. Can you use `gdb` to tell the exact line that this segfault is triggered? For GDRCopy,...
Hi @tylerjereddy , I reviewed the NVSHMEM libfabric transport code. It does not use GDRCopy with Slingshot -- at least in NVSHMEM 2.10.1. However, libfabric itself (not NVSHMEM libfabric transport)...
> any risk that some problems arise because I'm building a newer gdrcopy than the driver version available on the HPC machine? libgdrapi.so and gdrdrv (driver) are forward and backward...
Thank you @tylerjereddy. I suspect that you may run into a race condition from multithreading. GDRCopy, especially libgdrapi.so, is not thread safe. Anyway, I added a global lock to some...
Sorry, there was a left-over code block. I just removed it. Please try again. Note that this is not our final solution. It is just an adhoc implementation to see...