ucx
ucx copied to clipboard
UCT/CUDA_IPC: Redesign peer accessible cache
What
Redesigned peer accessible cache in uct_cuda_ipc component. There were several design flaws in existing implementation:
- uct_cuda_ipc_component stores a reference to the last created MD. However if last MD is closed this leads to the dangling pointer (root cause of VASP application segfault)
- peer accessible cache is not thread safe, fails in case of multiple contexts
- cache is stored per each cuda_ipc MD, but in fact only the last created MD is used for caching
- minor: improved lookup to use hash map instead of an array based search
Consideration: locking
- global cache, single lock (current impl)
- cache in TLS
- cache per UCT worker
UPD: I tested lock overhead with ucx_perftest on 4-16 threads with different message sizes (64-4096). There is no visible performance drop when using global lock for peer accessibility cache.
Why ?
Fix for https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=4705183 (RM#3955117)
/azp run
Azure Pipelines successfully started running 4 pipeline(s).