ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCT/CUDA_IPC: Redesign peer accessible cache

Open iyastreb opened this issue 1 year ago • 2 comments

What

Redesigned peer accessible cache in uct_cuda_ipc component. There were several design flaws in existing implementation:

  • uct_cuda_ipc_component stores a reference to the last created MD. However if last MD is closed this leads to the dangling pointer (root cause of VASP application segfault)
  • peer accessible cache is not thread safe, fails in case of multiple contexts
  • cache is stored per each cuda_ipc MD, but in fact only the last created MD is used for caching
  • minor: improved lookup to use hash map instead of an array based search

Consideration: locking

  • global cache, single lock (current impl)
  • cache in TLS
  • cache per UCT worker

UPD: I tested lock overhead with ucx_perftest on 4-16 threads with different message sizes (64-4096). There is no visible performance drop when using global lock for peer accessibility cache.

Why ?

Fix for https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=4705183 (RM#3955117)

iyastreb avatar Jun 26 '24 14:06 iyastreb

/azp run

iyastreb avatar Jun 27 '24 11:06 iyastreb

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines[bot] avatar Jun 27 '24 11:06 azure-pipelines[bot]