ucx icon indicating copy to clipboard operation
ucx copied to clipboard

GTEST/COMMON: Cache CUDA device BAR1 available size

Open tvegas1 opened this issue 9 months ago • 6 comments

What

Cache the obtained BAR1 available size.

Why ?

The nvmlInit_v2() seems to fork()/wait() which is causing issues when running under valgrind.

Internal link: 3886807

How ?

Only request available size once on a given run.

tvegas1 avatar May 15 '24 16:05 tvegas1

ASAN failures look very much related but they are not as they are also found on CI failures for #9870:

2024-05-15T18:12:48.0241690Z     #1 0x7fcbb242d595  (/usr/lib64/libnvidia-ml.so.1+0x11c595)
2024-05-15T18:12:48.0242359Z     #2 0x7fcbb232bceb in nvmlInitWithFlags (/usr/lib64/libnvidia-ml.so.1+0x1aceb)
2024-05-15T18:12:48.0242914Z     #3 0x7fcbb525ae74 in uct_cuda_ipc_get_device_nvlinks /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_iface.c:138

tvegas1 avatar May 16 '24 07:05 tvegas1

ASAN failures look very much related but they are not as they are also found on CI failures for #9870:

2024-05-15T18:12:48.0241690Z     #1 0x7fcbb242d595  (/usr/lib64/libnvidia-ml.so.1+0x11c595)
2024-05-15T18:12:48.0242359Z     #2 0x7fcbb232bceb in nvmlInitWithFlags (/usr/lib64/libnvidia-ml.so.1+0x1aceb)
2024-05-15T18:12:48.0242914Z     #3 0x7fcbb525ae74 in uct_cuda_ipc_get_device_nvlinks /__w/1/s/contrib/../src/uct/cuda/cuda_ipc/cuda_ipc_iface.c:138

What if we get the BAR1 size during startup, and not on-demand when running tests?

yosefe avatar May 16 '24 08:05 yosefe

What if we get the BAR1 size during startup, and not on-demand when running tests?

If I get it right, you are suggesting to implement getting BAR1 size at test startup and not on-demand? It is possible but since we don't know when/if the data will be needed, I though keeping it like that.

Also putting it in some test init, for example, would still call it many times, whereas here with the static, it's only ever called once per gtest run at most.

tvegas1 avatar May 16 '24 08:05 tvegas1

What if we get the BAR1 size during startup, and not on-demand when running tests?

If I get it right, you are suggesting to implement getting BAR1 size at test startup and not on-demand? It is possible but since we don't know when/if the data will be needed, I though keeping it like that.

Also putting it in some test init, for example, would still call it many times, whereas here with the static, it's only ever called once per gtest run at most.

I mean to do it before tests are running. Maybe this way we can avoid calling nvmlInit() twice (without destroy) and avoid the ASAN error

yosefe avatar May 16 '24 08:05 yosefe

addressed, but since other failure comes from uct_cuda_ipc_get_device_nvlinks(), not from get bar1 test function, there is possibility that leak will persist.

tvegas1 avatar May 16 '24 09:05 tvegas1

re-added the leak suppression since we still see a leak detected, although all nvmlInit calls being matched with nvmlShutdown in codebase.

tvegas1 avatar May 16 '24 12:05 tvegas1

Just calling it once in gtest should be okay, unless you have some other cuda app running in parallel.

shamisp avatar May 18 '24 13:05 shamisp

/azp run UCX PR

yosefe avatar May 19 '24 05:05 yosefe

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar May 19 '24 05:05 azure-pipelines[bot]