Thomas Vegas comments

Results 22 comments of


                                            Thomas Vegas

ucp_client_server client not stop

Tried similar commands below and did not see any repro. Could you please try a later version? ``` ./examples/ucp_client_server -c am -i 10000 -s 10000 ./examples/ucp_client_server -c am -i 10000...

ucp_client_server client not stop

> the test seem to ends, no more prints but client not return both server and client 99% CPU I compile v.16x devel that should be fixed by #9701

ucp_client_server client not stop

merged #9701

Flag UCP_OP_ATTR_FLAG_FORCE_IMM_CMPL is ignored by ucp_put_nb when UCX_PROTO_ENABLE=y

Would patch below help? @yosefe, @brminich, is the intent correct for proto enable yes? ```diff diff --git a/src/ucp/rma/rma_send.c b/src/ucp/rma/rma_send.c index 2e6d659..13ec60c 100644 --- a/src/ucp/rma/rma_send.c +++ b/src/ucp/rma/rma_send.c @@ -271,6 +271,11 @@...

UCP/RMA: Do not flush UCT EP if UCT iface is not active.

Seems the perftest MAD failure could be related to PR.

UCX use a large amount of SYSV HugePage memory

assuming UCX allocates huge pages with sysv transport you could try: - disable huge pages `UCX_SYSV_HUGETLB_MODE=no` - disable sysv transport like `UCX_TLS=rc_x,tcp,self` else if it is related to internal buffers...

UCX use a large amount of SYSV HugePage memory

@yosefe, shall we allow non-huge pages allocation for `ucp_am_bufs`?

GTEST/COMMON: Cache CUDA device BAR1 available size

ASAN failures look very much related but they are not as they are also found on CI failures for #9870: ``` 2024-05-15T18:12:48.0241690Z #1 0x7fcbb242d595 (/usr/lib64/libnvidia-ml.so.1+0x11c595) 2024-05-15T18:12:48.0242359Z #2 0x7fcbb232bceb in nvmlInitWithFlags...

GTEST/COMMON: Cache CUDA device BAR1 available size

> What if we get the BAR1 size during startup, and not on-demand when running tests? If I get it right, you are suggesting to implement getting BAR1 size at...

GTEST/COMMON: Cache CUDA device BAR1 available size

addressed, but since other failure comes from `uct_cuda_ipc_get_device_nvlinks()`, not from get bar1 test function, there is possibility that leak will persist.