ucx GTEST/COMMON: Cache CUDA device BAR1 available size

What

Cache the obtained BAR1 available size.

Why ?

The nvmlInit_v2() seems to fork()/wait() which is causing issues when running under valgrind.

Internal link: 3886807

How ?

Only request available size once on a given run.

May 15 '24 16:05 tvegas1

ASAN failures look very much related but they are not as they are also found on CI failures for #9870:

2024-05-15T18:12:48.0241690Z     #1 0x7fcbb242d595  (/usr/lib64/libnvidia-ml.so.1+0x11c595)
2024-05-15T18:12:48.0242359Z     #2 0x7fcbb232bceb in nvmlInitWithFlags (/usr/lib64/libnvidia-ml.so.1+0x1aceb)
2024-05-15T18:12:48.0242914Z     #3 0x7fcbb525ae74 in uct_cuda_ipc_get_device_nvlinks /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_iface.c:138

May 16 '24 07:05 tvegas1

ASAN failures look very much related but they are not as they are also found on CI failures for #9870:

2024-05-15T18:12:48.0241690Z     #1 0x7fcbb242d595  (/usr/lib64/libnvidia-ml.so.1+0x11c595)
2024-05-15T18:12:48.0242359Z     #2 0x7fcbb232bceb in nvmlInitWithFlags (/usr/lib64/libnvidia-ml.so.1+0x1aceb)
2024-05-15T18:12:48.0242914Z     #3 0x7fcbb525ae74 in uct_cuda_ipc_get_device_nvlinks /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_iface.c:138

What if we get the BAR1 size during startup, and not on-demand when running tests?

May 16 '24 08:05 yosefe

What if we get the BAR1 size during startup, and not on-demand when running tests?

If I get it right, you are suggesting to implement getting BAR1 size at test startup and not on-demand? It is possible but since we don't know when/if the data will be needed, I though keeping it like that.

Also putting it in some test init, for example, would still call it many times, whereas here with the static, it's only ever called once per gtest run at most.

May 16 '24 08:05 tvegas1

What if we get the BAR1 size during startup, and not on-demand when running tests?

If I get it right, you are suggesting to implement getting BAR1 size at test startup and not on-demand? It is possible but since we don't know when/if the data will be needed, I though keeping it like that.

Also putting it in some test init, for example, would still call it many times, whereas here with the static, it's only ever called once per gtest run at most.

I mean to do it before tests are running. Maybe this way we can avoid calling nvmlInit() twice (without destroy) and avoid the ASAN error

May 16 '24 08:05 yosefe

addressed, but since other failure comes from uct_cuda_ipc_get_device_nvlinks(), not from get bar1 test function, there is possibility that leak will persist.

May 16 '24 09:05 tvegas1

re-added the leak suppression since we still see a leak detected, although all nvmlInit calls being matched with nvmlShutdown in codebase.

May 16 '24 12:05 tvegas1

Just calling it once in gtest should be okay, unless you have some other cuda app running in parallel.

May 18 '24 13:05 shamisp

/azp run UCX PR

May 19 '24 05:05 yosefe

Azure Pipelines successfully started running 1 pipeline(s).

May 19 '24 05:05 azure-pipelines[bot]

ucx ucx copied to clipboard

GTEST/COMMON: Cache CUDA device BAR1 available size

What

Why ?

How ?

ucx
ucx copied to clipboard