ucx UCT/CUDA_IPC: Redesign peer accessible cache

What

Redesigned peer accessible cache in uct_cuda_ipc component. There were several design flaws in existing implementation:

uct_cuda_ipc_component stores a reference to the last created MD. However if last MD is closed this leads to the dangling pointer (root cause of VASP application segfault)
peer accessible cache is not thread safe, fails in case of multiple contexts
cache is stored per each cuda_ipc MD, but in fact only the last created MD is used for caching
minor: improved lookup to use hash map instead of an array based search

Consideration: locking

global cache, single lock (current impl)
cache in TLS
cache per UCT worker

UPD: I tested lock overhead with ucx_perftest on 4-16 threads with different message sizes (64-4096). There is no visible performance drop when using global lock for peer accessibility cache.

Why ?

Fix for https://nvbugswb.nvidia.com/NvBugs5/SWBug.aspx?bugid=4705183 (RM#3955117)

Jun 26 '24 14:06 iyastreb

/azp run

Jun 27 '24 11:06 iyastreb

Azure Pipelines successfully started running 4 pipeline(s).

Jun 27 '24 11:06 azure-pipelines[bot]