ucx
ucx copied to clipboard
UCS/TOPO: generate sys-dev index based on device entry position in sysfs
What
Use entry position of given device in /sys/bus/pci/devices
instead of device iteration count as seen by topo sys on the given process
Why ?
Hopefully this ensures that all processes see the same sys device index on a given system irrespective of the order in which system devices are populated by each individual process on the system. This allows for system-unique system device index for a given domain:bdf addressed pci device and no exchange of system_device_t -> bus_id is required to meaningfully use a remote_sys_dev for the purposes of iface_estimate_perf.
cc @yosefe
@yosefe seeing this sort of common failure across tests:
2022-03-12T01:46:22.0775077Z [ RUN ] dcx/test_ucp_mmap.fixed/2 <dc_x,cuda_copy,rocm_copy/proto>
2022-03-12T01:46:22.4048345Z unknown file: Failure
2022-03-12T01:46:22.4049125Z C++ exception with description "basic_string::_S_construct null not valid" thrown in the test body.
2022-03-12T01:46:22.5224670Z [1647049582.521985] [swx-rdmz-ucx-new-02:2072 :0] rcache.c:674 UCX WARN ucp rcache: destroying inuse region 0x3ac6020 [0xff0000000..0xff0001000] g- rw ref 1 md[0]=mlx5_0 md[1]=mlx5_2 md[2]=mlx5_3
2022-03-12T01:46:22.5227079Z [swx-rdmz-ucx-new-02:2072 :0:2072] rcache.c:410 Assertion `region->refcount == 0' failed: region 0x3ac6020 0xff0000000..0xff0001000 of ucp rcache
2022-03-12T01:46:23.1726020Z
2022-03-12T01:46:23.1728574Z /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c: [ ucs_mem_region_destroy_internal() ]
2022-03-12T01:46:23.1729486Z ...
2022-03-12T01:46:23.1730030Z 406
2022-03-12T01:46:23.1730673Z 407 ucs_rcache_region_trace(rcache, region, "destroy");
2022-03-12T01:46:23.1731309Z 408
2022-03-12T01:46:23.1732215Z ==> 409 ucs_assertv(region->refcount == 0, "region %p 0x%lx..0x%lx of %s", region,
2022-03-12T01:46:23.1733325Z 410 region->super.start, region->super.end, rcache->name);
2022-03-12T01:46:23.1734369Z 411 ucs_assert(!(region->flags & UCS_RCACHE_REGION_FLAG_PGTABLE));
2022-03-12T01:46:23.1735109Z 412
2022-03-12T01:46:23.1735572Z
2022-03-12T01:46:23.8380266Z ==== backtrace (tid: 2072) ====
2022-03-12T01:46:23.8382006Z 0 0x000000000006c2b6 ucs_mem_region_destroy_internal() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:409
2022-03-12T01:46:23.8382960Z 1 0x000000000006c2b6 ucs_mem_region_destroy_internal() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:429
2022-03-12T01:46:23.8383792Z 2 0x000000000006ef91 ucs_rcache_purge() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:676
2022-03-12T01:46:23.8385072Z 3 0x000000000006ef91 ucs_rcache_t_cleanup() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:1334
2022-03-12T01:46:23.8385900Z 4 0x000000000007d04e ucs_class_call_cleanup_chain() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/type/class.c:56
2022-03-12T01:46:23.8386705Z 5 0x000000000006f690 ucs_rcache_destroy() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:1358
2022-03-12T01:46:23.8387476Z 6 0x0000000000024f1f ucp_cleanup() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_context.c:1919
2022-03-12T01:46:23.8388322Z 7 0x0000000000024f1f ucp_cleanup() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_context.c:1920
2022-03-12T01:46:23.8389241Z 8 0x0000000000a130c8 ucs::handle<ucp_context*, void*>::release() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test_helpers.h:725
2022-03-12T01:46:23.8390386Z 9 0x0000000000a130c8 ucs::handle<ucp_context*, void*>::reset() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test_helpers.h:660
2022-03-12T01:46:23.8391268Z 10 0x0000000000a130c8 ~handle() /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test_helpers.h:655
2022-03-12T01:46:23.8392117Z 11 0x0000000000a130c8 ucp_test_base::entity::
As this PR touches sys_dev indices, can you suggest if rcache path should see effects? The most notable change is that sys_dev indices aren't ordered from 0->256 but instead there would an assorted list of entries between 0 and 255. Do you see any immediate issues? I'm unable to reproduce the above error on local machines but I do see the following failures instead:
./test/gtest/gtest --gtest_filter=*rc/test_ucp_mmap.reg_mem_type*
...
The difference between dist1.bandwidth and dist2.bandwidth is 1.7976931348623157e+308, which exceeds 600e6, where
dist1.bandwidth evaluates to 2199023255552,
dist2.bandwidth evaluates to 1.7976931348623157e+308, and
600e6 evaluates to 600000000.
../../../test/gtest/ucp/test_ucp_mmap.cc:261: Failure
The difference between dist1.bandwidth and dist2.bandwidth is 1.7976931348623157e+308, which exceeds 600e6, where
dist1.bandwidth evaluates to 2199023255552,
dist2.bandwidth evaluates to 1.7976931348623157e+308, and
600e6 evaluates to 600000000.
[ FAILED ] rc/test_ucp_mmap.reg_mem_type/2, where GetParam() = rc_v,cuda_copy,rocm_copy/proto (5870 ms)
[----------] 3 tests from rc/test_ucp_mmap (17271 ms total)
@Akshay-Venkatesh I don't see an obvious reason; maybe try to narrow it down, by changing previous code to generate non-linear device ids?