ucx
ucx copied to clipboard
UCP/CORE: Use AM Bcopy for all ranges to not break reconfiguration
What
Use AM Bcopy for all ranges to not break reconfiguration.
Why ?
Fixes #8180.
How ?
- If CM_USE_ALL_DEVICES=y and initializing EP configuration during CM phase, set
bcopy_thresh = 0,zcopy_thresh = SIZE_MAX,rndv_thresh = SIZE_MAX. - Save thresholds to EP configuration.
- Take into account
bcopy_threshwhen settingmax_short(currently, we use onlyzcopy_threshandrndv_thresh).
reproduced the issue in the gtests, e.g.:
[ RUN ] shm_tcp/test_ucp_sockaddr_protocols.stream_bcopy_4k_exp/0 <shm,tcp,cuda_copy,rocm_copy/all_features>
[New Thread 0x7ffff2582700 (LWP 59919)]
[Thread 0x7ffff2582700 (LWP 59919) exited]
[New Thread 0x7ffff2582700 (LWP 59920)]
[Thread 0x7ffff2582700 (LWP 59920) exited]
[New Thread 0x7ffff2582700 (LWP 59921)]
[Thread 0x7ffff2582700 (LWP 59921) exited]
[New Thread 0x7ffff2582700 (LWP 59922)]
[ INFO ] server listening on 127.0.0.1:37502
[1652264424.446663] [swx-ucx02:59471:0] mm_ep.c:386 UCX ERROR Invalid am_short length: 4104 (expected: <= 100)
ucp/test_ucp_sockaddr.cc:279: Failure
Error: Invalid parameter
[swx-ucx02:59471:0:59471] ucp_worker.c:2781 Assertion `worker->inprogress++ == 0' failed
/hpc/mtr_scrap/users/dmitrygla/ucx/src/ucp/core/ucp_worker.c: [ ucp_worker_progress() ]
...
2778 UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);
2779
2780 /* check that ucp_worker_progress is not called from within ucp_worker_progress */
==> 2781 ucs_assert(worker->inprogress++ == 0);
2782 count = uct_worker_progress(worker->uct);
2783 ucs_async_check_miss(&worker->async);
2784
==== backtrace (tid: 59471) ====
0 0x0000000000084f2c ucp_worker_progress() /hpc/mtr_scrap/users/dmitrygla/ucx/src/ucp/core/ucp_worker.c:2781
1 0x0000000000a8d7c4 ucp_test_base::entity::progress() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:1047
2 0x0000000000a882f2 ucp_test::progress() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:168
3 0x0000000000a886f7 ucp_test::check_events() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:249
4 0x0000000000a889bd ucp_test::request_process() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:291
5 0x0000000000a88b01 ucp_test::request_wait() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:311
6 0x0000000000a8842a ucp_test::flush_worker() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:195
7 0x0000000000a885de ucp_test::disconnect() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:222
8 0x0000000000a87d7a ucp_test::cleanup() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:86
9 0x0000000000594bed ucs::test_base::TearDownProxy() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/test.cc:324
10 0x00000000007b46b6 ucp_test::TearDown() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.h:202
11 0x0000000000e1ecf2 testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2433
12 0x0000000000e1ab7c testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2469
13 0x0000000000e05958 testing::Test::Run() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2517
14 0x0000000000e06177 testing::TestInfo::Run() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2687
15 0x0000000000e06830 testing::TestSuite::Run() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2819
16 0x0000000000e11a2e testing::internal::UnitTestImpl::RunAllTests() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:5350
17 0x0000000000e1f9f5 testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2433
18 0x0000000000e1ba20 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2469
19 0x0000000000e104fa testing::UnitTest::Run() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:4940
20 0x000000000057c3ef RUN_ALL_TESTS() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.h:2473
21 0x000000000057c2eb main() /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/main.cc:106
22 0x00000000000223d5 __libc_start_main() ???:0
23 0x000000000057ba69 _start() ???:0
=================================
the test failure (hang in perf_envelope) seems relevant
the test failure (hang in perf_envelope) seems relevant
I’ll check why, but also need to fix valgrind error due to incorrect DC CQ length (https://github.com/openucx/ucx/pull/8222)