ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCS/TOPO: generate sys-dev index based on device entry position in sysfs

Open Akshay-Venkatesh opened this issue 2 years ago • 3 comments

What

Use entry position of given device in /sys/bus/pci/devices instead of device iteration count as seen by topo sys on the given process

Why ?

Hopefully this ensures that all processes see the same sys device index on a given system irrespective of the order in which system devices are populated by each individual process on the system. This allows for system-unique system device index for a given domain:bdf addressed pci device and no exchange of system_device_t -> bus_id is required to meaningfully use a remote_sys_dev for the purposes of iface_estimate_perf.

Akshay-Venkatesh avatar Mar 11 '22 01:03 Akshay-Venkatesh

cc @yosefe

Akshay-Venkatesh avatar Mar 11 '22 01:03 Akshay-Venkatesh

@yosefe seeing this sort of common failure across tests:

2022-03-12T01:46:22.0775077Z [ RUN      ] dcx/test_ucp_mmap.fixed/2 <dc_x,cuda_copy,rocm_copy/proto>
2022-03-12T01:46:22.4048345Z unknown file: Failure
2022-03-12T01:46:22.4049125Z C++ exception with description "basic_string::_S_construct null not valid" thrown in the test body.
2022-03-12T01:46:22.5224670Z [1647049582.521985] [swx-rdmz-ucx-new-02:2072 :0]          rcache.c:674  UCX  WARN  ucp rcache: destroying inuse region 0x3ac6020 [0xff0000000..0xff0001000] g- rw ref 1  md[0]=mlx5_0 md[1]=mlx5_2 md[2]=mlx5_3
2022-03-12T01:46:22.5227079Z [swx-rdmz-ucx-new-02:2072 :0:2072]      rcache.c:410  Assertion `region->refcount == 0' failed: region 0x3ac6020 0xff0000000..0xff0001000 of ucp rcache
2022-03-12T01:46:23.1726020Z 
2022-03-12T01:46:23.1728574Z /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c: [ ucs_mem_region_destroy_internal() ]
2022-03-12T01:46:23.1729486Z       ...
2022-03-12T01:46:23.1730030Z       406 
2022-03-12T01:46:23.1730673Z       407     ucs_rcache_region_trace(rcache, region, "destroy");
2022-03-12T01:46:23.1731309Z       408 
2022-03-12T01:46:23.1732215Z ==>   409     ucs_assertv(region->refcount == 0, "region %p 0x%lx..0x%lx of %s", region,
2022-03-12T01:46:23.1733325Z       410                 region->super.start, region->super.end, rcache->name);
2022-03-12T01:46:23.1734369Z       411     ucs_assert(!(region->flags & UCS_RCACHE_REGION_FLAG_PGTABLE));
2022-03-12T01:46:23.1735109Z       412 
2022-03-12T01:46:23.1735572Z 
2022-03-12T01:46:23.8380266Z ==== backtrace (tid:   2072) ====
2022-03-12T01:46:23.8382006Z  0 0x000000000006c2b6 ucs_mem_region_destroy_internal()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:409
2022-03-12T01:46:23.8382960Z  1 0x000000000006c2b6 ucs_mem_region_destroy_internal()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:429
2022-03-12T01:46:23.8383792Z  2 0x000000000006ef91 ucs_rcache_purge()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:676
2022-03-12T01:46:23.8385072Z  3 0x000000000006ef91 ucs_rcache_t_cleanup()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:1334
2022-03-12T01:46:23.8385900Z  4 0x000000000007d04e ucs_class_call_cleanup_chain()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/type/class.c:56
2022-03-12T01:46:23.8386705Z  5 0x000000000006f690 ucs_rcache_destroy()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/memory/rcache.c:1358
2022-03-12T01:46:23.8387476Z  6 0x0000000000024f1f ucp_cleanup()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_context.c:1919
2022-03-12T01:46:23.8388322Z  7 0x0000000000024f1f ucp_cleanup()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucp/core/ucp_context.c:1920
2022-03-12T01:46:23.8389241Z  8 0x0000000000a130c8 ucs::handle<ucp_context*, void*>::release()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test_helpers.h:725
2022-03-12T01:46:23.8390386Z  9 0x0000000000a130c8 ucs::handle<ucp_context*, void*>::reset()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test_helpers.h:660
2022-03-12T01:46:23.8391268Z 10 0x0000000000a130c8 ~handle()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test_helpers.h:655
2022-03-12T01:46:23.8392117Z 11 0x0000000000a130c8 ucp_test_base::entity::

As this PR touches sys_dev indices, can you suggest if rcache path should see effects? The most notable change is that sys_dev indices aren't ordered from 0->256 but instead there would an assorted list of entries between 0 and 255. Do you see any immediate issues? I'm unable to reproduce the above error on local machines but I do see the following failures instead:

./test/gtest/gtest --gtest_filter=*rc/test_ucp_mmap.reg_mem_type*
...
The difference between dist1.bandwidth and dist2.bandwidth is 1.7976931348623157e+308, which exceeds 600e6, where
dist1.bandwidth evaluates to 2199023255552,
dist2.bandwidth evaluates to 1.7976931348623157e+308, and
600e6 evaluates to 600000000.
../../../test/gtest/ucp/test_ucp_mmap.cc:261: Failure
The difference between dist1.bandwidth and dist2.bandwidth is 1.7976931348623157e+308, which exceeds 600e6, where
dist1.bandwidth evaluates to 2199023255552,
dist2.bandwidth evaluates to 1.7976931348623157e+308, and
600e6 evaluates to 600000000.
[  FAILED  ] rc/test_ucp_mmap.reg_mem_type/2, where GetParam() = rc_v,cuda_copy,rocm_copy/proto (5870 ms)
[----------] 3 tests from rc/test_ucp_mmap (17271 ms total)

Akshay-Venkatesh avatar Mar 12 '22 20:03 Akshay-Venkatesh

@Akshay-Venkatesh I don't see an obvious reason; maybe try to narrow it down, by changing previous code to generate non-linear device ids?

yosefe avatar Mar 13 '22 16:03 yosefe