ucx
ucx copied to clipboard
GTEST/COMMON: Cache CUDA device BAR1 available size
What
Cache the obtained BAR1 available size.
Why ?
The nvmlInit_v2()
seems to fork()/wait()
which is causing issues when running under valgrind.
Internal link: 3886807
How ?
Only request available size once on a given run.
ASAN failures look very much related but they are not as they are also found on CI failures for #9870:
2024-05-15T18:12:48.0241690Z #1 0x7fcbb242d595 (/usr/lib64/libnvidia-ml.so.1+0x11c595)
2024-05-15T18:12:48.0242359Z #2 0x7fcbb232bceb in nvmlInitWithFlags (/usr/lib64/libnvidia-ml.so.1+0x1aceb)
2024-05-15T18:12:48.0242914Z #3 0x7fcbb525ae74 in uct_cuda_ipc_get_device_nvlinks /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_iface.c:138
ASAN failures look very much related but they are not as they are also found on CI failures for #9870:
2024-05-15T18:12:48.0241690Z #1 0x7fcbb242d595 (/usr/lib64/libnvidia-ml.so.1+0x11c595) 2024-05-15T18:12:48.0242359Z #2 0x7fcbb232bceb in nvmlInitWithFlags (/usr/lib64/libnvidia-ml.so.1+0x1aceb) 2024-05-15T18:12:48.0242914Z #3 0x7fcbb525ae74 in uct_cuda_ipc_get_device_nvlinks /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_iface.c:138
What if we get the BAR1 size during startup, and not on-demand when running tests?
What if we get the BAR1 size during startup, and not on-demand when running tests?
If I get it right, you are suggesting to implement getting BAR1 size at test startup and not on-demand? It is possible but since we don't know when/if the data will be needed, I though keeping it like that.
Also putting it in some test init, for example, would still call it many times, whereas here with the static, it's only ever called once per gtest run at most.
What if we get the BAR1 size during startup, and not on-demand when running tests?
If I get it right, you are suggesting to implement getting BAR1 size at test startup and not on-demand? It is possible but since we don't know when/if the data will be needed, I though keeping it like that.
Also putting it in some test init, for example, would still call it many times, whereas here with the static, it's only ever called once per gtest run at most.
I mean to do it before tests are running. Maybe this way we can avoid calling nvmlInit() twice (without destroy) and avoid the ASAN error
addressed, but since other failure comes from uct_cuda_ipc_get_device_nvlinks()
, not from get bar1 test function, there is possibility that leak will persist.
re-added the leak suppression since we still see a leak detected, although all nvmlInit calls being matched with nvmlShutdown in codebase.
Just calling it once in gtest should be okay, unless you have some other cuda app running in parallel.
/azp run UCX PR
Azure Pipelines successfully started running 1 pipeline(s).