ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCP/CORE: Use AM Bcopy for all ranges to not break reconfiguration

Open dmitrygx opened this issue 3 years ago • 3 comments

What

Use AM Bcopy for all ranges to not break reconfiguration.

Why ?

Fixes #8180.

How ?

  1. If CM_USE_ALL_DEVICES=y and initializing EP configuration during CM phase, set bcopy_thresh = 0, zcopy_thresh = SIZE_MAX, rndv_thresh = SIZE_MAX.
  2. Save thresholds to EP configuration.
  3. Take into account bcopy_thresh when setting max_short (currently, we use only zcopy_thresh and rndv_thresh).

dmitrygx avatar May 04 '22 06:05 dmitrygx

reproduced the issue in the gtests, e.g.:

[ RUN      ] shm_tcp/test_ucp_sockaddr_protocols.stream_bcopy_4k_exp/0 <shm,tcp,cuda_copy,rocm_copy/all_features>
[New Thread 0x7ffff2582700 (LWP 59919)]
[Thread 0x7ffff2582700 (LWP 59919) exited]
[New Thread 0x7ffff2582700 (LWP 59920)]
[Thread 0x7ffff2582700 (LWP 59920) exited]
[New Thread 0x7ffff2582700 (LWP 59921)]
[Thread 0x7ffff2582700 (LWP 59921) exited]
[New Thread 0x7ffff2582700 (LWP 59922)]
[     INFO ] server listening on 127.0.0.1:37502
[1652264424.446663] [swx-ucx02:59471:0]           mm_ep.c:386  UCX  ERROR Invalid am_short length: 4104 (expected: <= 100)
ucp/test_ucp_sockaddr.cc:279: Failure
Error: Invalid parameter
[swx-ucx02:59471:0:59471]  ucp_worker.c:2781 Assertion `worker->inprogress++ == 0' failed

/hpc/mtr_scrap/users/dmitrygla/ucx/src/ucp/core/ucp_worker.c: [ ucp_worker_progress() ]
      ...
     2778     UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);
     2779
     2780     /* check that ucp_worker_progress is not called from within ucp_worker_progress */
==>  2781     ucs_assert(worker->inprogress++ == 0);
     2782     count = uct_worker_progress(worker->uct);
     2783     ucs_async_check_miss(&worker->async);
     2784

==== backtrace (tid:  59471) ====
 0 0x0000000000084f2c ucp_worker_progress()  /hpc/mtr_scrap/users/dmitrygla/ucx/src/ucp/core/ucp_worker.c:2781
 1 0x0000000000a8d7c4 ucp_test_base::entity::progress()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:1047
 2 0x0000000000a882f2 ucp_test::progress()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:168
 3 0x0000000000a886f7 ucp_test::check_events()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:249
 4 0x0000000000a889bd ucp_test::request_process()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:291
 5 0x0000000000a88b01 ucp_test::request_wait()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:311
 6 0x0000000000a8842a ucp_test::flush_worker()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:195
 7 0x0000000000a885de ucp_test::disconnect()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:222
 8 0x0000000000a87d7a ucp_test::cleanup()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.cc:86
 9 0x0000000000594bed ucs::test_base::TearDownProxy()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/test.cc:324
10 0x00000000007b46b6 ucp_test::TearDown()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/ucp/ucp_test.h:202
11 0x0000000000e1ecf2 testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2433
12 0x0000000000e1ab7c testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2469
13 0x0000000000e05958 testing::Test::Run()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2517
14 0x0000000000e06177 testing::TestInfo::Run()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2687
15 0x0000000000e06830 testing::TestSuite::Run()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2819
16 0x0000000000e11a2e testing::internal::UnitTestImpl::RunAllTests()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:5350
17 0x0000000000e1f9f5 testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2433
18 0x0000000000e1ba20 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:2469
19 0x0000000000e104fa testing::UnitTest::Run()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.cc:4940
20 0x000000000057c3ef RUN_ALL_TESTS()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/googletest/gtest.h:2473
21 0x000000000057c2eb main()  /hpc/mtr_scrap/users/dmitrygla/ucx/test/gtest/common/main.cc:106
22 0x00000000000223d5 __libc_start_main()  ???:0
23 0x000000000057ba69 _start()  ???:0
=================================

dmitrygx avatar May 11 '22 10:05 dmitrygx

the test failure (hang in perf_envelope) seems relevant

yosefe avatar May 14 '22 10:05 yosefe

the test failure (hang in perf_envelope) seems relevant

I’ll check why, but also need to fix valgrind error due to incorrect DC CQ length (https://github.com/openucx/ucx/pull/8222)

dmitrygx avatar May 14 '22 12:05 dmitrygx