Thomas Vegas
Thomas Vegas
Tried similar commands below and did not see any repro. Could you please try a later version? ``` ./examples/ucp_client_server -c am -i 10000 -s 10000 ./examples/ucp_client_server -c am -i 10000...
> the test seem to ends, no more prints but client not return both server and client 99% CPU I compile v.16x devel that should be fixed by #9701
merged #9701
Would patch below help? @yosefe, @brminich, is the intent correct for proto enable yes? ```diff diff --git a/src/ucp/rma/rma_send.c b/src/ucp/rma/rma_send.c index 2e6d659..13ec60c 100644 --- a/src/ucp/rma/rma_send.c +++ b/src/ucp/rma/rma_send.c @@ -271,6 +271,11 @@...
Seems the perftest MAD failure could be related to PR.
assuming UCX allocates huge pages with sysv transport you could try: - disable huge pages `UCX_SYSV_HUGETLB_MODE=no` - disable sysv transport like `UCX_TLS=rc_x,tcp,self` else if it is related to internal buffers...
@yosefe, shall we allow non-huge pages allocation for `ucp_am_bufs`?
ASAN failures look very much related but they are not as they are also found on CI failures for #9870: ``` 2024-05-15T18:12:48.0241690Z #1 0x7fcbb242d595 (/usr/lib64/libnvidia-ml.so.1+0x11c595) 2024-05-15T18:12:48.0242359Z #2 0x7fcbb232bceb in nvmlInitWithFlags...
> What if we get the BAR1 size during startup, and not on-demand when running tests? If I get it right, you are suggesting to implement getting BAR1 size at...
addressed, but since other failure comes from `uct_cuda_ipc_get_device_nvlinks()`, not from get bar1 test function, there is possibility that leak will persist.