ucx
ucx copied to clipboard
ARCH/X86: Introduce non temporal buffer transfer
Processors based on AMD's Zen 3/4 architecture typically organize CPU cores into clusters of 8 or more cores. Within each cluster, these cores share a unified L3 cache, optimizing data transfer speeds among them. However, when data needs to travel between cores located in different clusters that do not share a common L3 cache, the transfer speed can be significantly reduced, potentially becoming 2-3 orders of magnitude slower.
To leverage this microarchitecture effectively MPI applications often employ 'n-ranks X m-threads' configuration.
n <= Number of unified L3caches in system m >= Number of cores within a cluster
When UCX is used as the underlying data transfer mechanism, depending on the rendezvous threshold two ranks/processes use either CICO(copy-in copy-out) or SCOPY(single copy) to move data between them in an intra-node scenario. This patch improves the data transfer speed/application-performance on AMD's Zen 3/4 architecture for both CICO and SCOPY when sender and receiver processes are pinned to cores that don't share a common L3 cache.
a)CICO
Sender Receiver
+-------------+ +------------+
+.............+ +............+
+ sbuf + + rbuf +
+.............+ +............+
+ + + +
+ + + +
+ + + +
+.............+>>>>>>>>>>>>>>>+............+
+ rshmem + POSIX/SYSV + +
+ mapped + mapping + rshmem +
+ area + + +
+.............+<<<<<<<<<<<<<<<+............+
+ + + +
+-------------+ +------------+
In the current method, the sender process first moves the data from source buffer(sbuf) to the remote shared memory (rshmem) using the glibc's memcpy and informs the receiver process about the presence of new data. The receiver process then proceeds to copy the data from rshmem to its local buffer(rbuf) once again utilizing glibc's memcpy, thus completing the transfer.
In contrast, the new approach optimizes this by using distinct memory copy techniques. It utilizes 'nontemporal store' or 'store with PREFETCHNTA' when copying data from sbuf to rshmem and employs 'loads with PREFETCHNTA' when copying data from rshmem to rbuf. This optimization effectively reduces data transfer latency by circumventing cache-to-cache data transfers, which tend to be slower, especially when cores are situated in different clusters. Additionally, this method helps in minimizing cache pollution on both the sender and receiver sides, further enhancing overall performance.
b)SCOPY (only xpmem is considered as other methods use memcpy inside kernel)
Sender Receiver
+-------------+ +--------------------+
+ + +....................+
+ + + rbuf +
+ + +....................+
+ + + +
+ + + +
+ + + +
+.............+>>>>>>>>>>>>>>>+....................+
+ sbuf + + sbuf_xpmem_map +
+.............+>>>>>>>>>>>>>>>+ +
+ + + physical pages of +
+ + + sbuf mapped to +
+ + + receiver's virtual +
+ + + memory by fault +
+ + + handler in xpmem.ko+
+ + +....................+
+ + + +
+ + + +
+-------------+ +--------------------+
The new method replaces glibc's memcpy of the receiver's data from sbuf_xpmem_map (src) to rbuf (dst) with 'loads with PREFETCHNTA'. This minimizes cache pollution inside receiver's core(sbuf pages are mapped with rw flags in the receiver process).
Acked-by: Edgar Gabriel [email protected]
How to build
use "--enable-builtin-memcpy=no --enable-optimizations --enable-nt-buffer-transfer" while configuring
Eg: $./contrib/configure-release --prefix=/install_path --with-xpmem=/xpmem_path --enable-builtin-memcpy=no --enable-optimizations --enable-nt-buffer-transfer
How to run
During mpirun nt-buffer-transfer is dynamically selected according to the table below.
use "-x UCX_NT_BUFFER_TRANSFER_MIN=0" during mpirun for a workload that doesn't have more than one rank in the L3 domain
Below is the comparison of latency when two ranks are in 2 different numa domains (From a Milan:Zen3 node)
cmd: mpirun -np 2 --map-by numa --bind-to core --mca pml ucx --mca opal_common_ucx_tls any --mca opal_common_ucx_devices any -x UCX_TLS=^ib -x UCX_NT_BUFFER_TRANSFER_MIN=0 ./osu_latency -m 1:268435456 -i 50000 -x 5000
Note: The default threshold values for selecting the nt-buffer-transfer for an individual transfer are set at 3/4 the size of L3 cache to address MPI Binding, where multiple ranks share a common L3 cache. If the MPI workload has only one rank in L3, nt-buffer-transfer can be used for all lengths, and the user can enforce this by passing "-x UCX_NT_BUFFER_TRANSFER_MIN=0" in the command line. .
--Arun
@yosefe I have re-designed it by adding the copy direction in ucx_memcpy_relaxed(). Please see.
Dynamic controls with variables similar to builtin_memcpy_min & builtin_memcpy_max have not yet implemented; I will add them soon. --Arun
pls fix the commit title , see CI failure (would need force push)
Done. I have forcefully pushed after changing the commit titles of all the commits in this series.
@yosefe I am not sure why the builds are failing in the automated tests, on my system I don't get such failures if I build without enabling 'nt-buffer-transfer'. Could you please help to solve it?
@yosefe I am not sure why the builds are failing in the automated tests, on my system I don't get such failures if I build without enabling 'nt-buffer-transfer'. Could you please help to solve it?
This is the error:
checking that generated files are newer than configure... done
configure: error: conditional "HAVE_NT_BUFFER_TRANSFER" was never defined.
Usually this means the macro was only invoked conditionally.
Maybe remove HAVE_NT_BUFFER_TRANSFER altogether?
@yosefe I am not sure why the builds are failing in the automated tests, on my system I don't get such failures if I build without enabling 'nt-buffer-transfer'. Could you please help to solve it?
This is the error:
checking that generated files are newer than configure... done configure: error: conditional "HAVE_NT_BUFFER_TRANSFER" was never defined. Usually this means the macro was only invoked conditionally.
Maybe remove HAVE_NT_BUFFER_TRANSFER altogether?
It was missing in configure.ac. I fixed it and pushed, let us see how the build behaves now. Thanks.
@arun-chandran-edarath pls see the merge conflicts
@yosefe I have implemented a dynamic-range-based usage depending on the total length of the transfer. Please take a look.
How to handle the build requirements?
Building it requires the use of --enable-optimizations (which in turn selects the necessary -march and -mavx options). How to specify if 'nt-buffer-transfer' is enabled and then select the extra compiler flags?
Hi All,
I want to collaborate with the folks from intel to get their opinion on setting the default values for all the tunable parameters of the nt-buffer-transfer patch. @yosefe Could you please help me with this request?
--Arun
@yosefe This PR has been sitting idle for a while, can you consider merging it before we branch out for 1.17?
Please let me know if there is anything I can do to help move this forward, or if there are any concerns or issues that need to be addressed.
--Arun
/azp run perf
Azure Pipelines successfully started running 1 pipeline(s).
2. implementing
in general, to expedite the review process, I'd suggest to break down this PR into 2 parts:
- adding hint and total_len to ucs_memcpy_relaxed and adjusting upper layers accordingly
- adding read/write prefetch and implementing NT memory copy in UCS
@yosefe, sorry for the delay in my response. I was on vacation.
As per your suggestion, I have broken down the PR into three parts:
- Adding total_len and hint to ucs_memcpy_relaxed.
- Adding the prefetch functions required for nt-buffer-transfer().
- Implementing nt-buffer-transfer() for x86_64. To expedite the review process, I have implemented parts 1 and 2 and pushed the changes. I have addressed most of the comments related to parts 1 and 2 in this push. However, there are some comments that require a more in-depth knowledge of the UCX codebase, which I currently lack. I might need expert help to solve those.
I will work on those comments for part 3 and push the changes as soon as possible.
In the meantime, let's focus on closing parts 1 and 2. --Arun
@arun-chandran-edarath pls avoid force push during review, fix the comments only by adding new commits or merge commits
@arun-chandran-edarath pls avoid force push during review, fix the comments only by adding new commits or merge commits
Thanks for the heads up. I'll avoid force push during review from now on.
@tvegas1 @rakhmets can you pls review as well?
/azp run perf
Azure Pipelines successfully started running 1 pipeline(s).
just for my understanding, do we have a repro (ucx_perftest, some osu tests..) with perf gain on applicable cpu?
saw description now, any perftest reproducer (maybe with xpmem/shmem) on different and then same CCD?
/azp run perf
Azure Pipelines successfully started running 1 pipeline(s).
just for my understanding, do we have a repro (ucx_perftest, some osu tests..) with perf gain on applicable cpu?
Yes, I already shared the osu_latency numbers for a zen3 machine with 32 MB L3 cache for 2 ranks in different numa domains. I will share the numbers for 2 ranks in the same CCD(L3 domain)
just for my understanding, do we have a repro (ucx_perftest, some osu tests..) with perf gain on applicable cpu?
saw description now, any perftest reproducer (maybe with xpmem/shmem) on different and then same CCD?
Please find the excel with osu_latency numbers for 4 different --map-by options (core, l3cache, numa and socket) on a zen3 machine with 32MB L3cache. CICO range is highlighted in yellow and SCOPY(xpmem) range is highlighted in green.
Command used: mpirun -np 2 --map-by "mapping" --bind-to core --mca pml ucx --mca opal_common_ucx_tls any --mca opal_common_ucx_devices any -x UCX_TLS=^ib -x UCX_NT_BUFFER_TRANSFER_MIN=0 ./osu_latency -m 1:268435456 -i 50000 -x 5000
@arun-chandran-edarath code is ok but there are 2 issues with the test:
- By default it doesn't run in CI because need to pass "--enable-optimizations" to build with AVX support. Can we add CI step for this - build with optimizations and run test_arch tests?
- The test takes very long time to run [[1] because it iterates over (window_size,source_align,dst_align) - need to make it no longer than 10-20 seconds
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from test_arch, where TypeParam =
[ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <>
[ OK ] test_arch.nt_buffer_transfer_nt_dst (746050 ms)
[ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <>
[ OK ] test_arch.nt_buffer_transfer_nt_src_dst (743477 ms)
[ RUN ] test_arch.nt_buffer_transfer_nt_src <> <>
[ OK ] test_arch.nt_buffer_transfer_nt_src (332944 ms)
[----------] 3 tests from test_arch (1822471 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1822472 ms total)
[ PASSED ] 3 tests.
- By default it doesn't run in CI because need to pass "--enable-optimizations" to build with AVX support. Can we add CI step for this - build with optimizations and run test_arch tests?
@yosefe @tvegas1 How shall we do it?
- The test takes very long time to run [[1] because it iterates over (window_size,source_align,dst_align) - need to make it no longer than 10-20 seconds
I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.
With this I currently get the below execution times. `[ RUN ] test_arch.nt_buffer_transfer_nt_src <> <>
[ OK ] test_arch.nt_buffer_transfer_nt_src (3036 ms)
[ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <>
[ OK ] test_arch.nt_buffer_transfer_nt_dst (9485 ms)
[ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <>
[ OK ] test_arch.nt_buffer_transfer_nt_src_dst (11294 ms)
[----------] 4 tests from test_arch (23815 ms total) `
@yosefe @tvegas1 How shall we do it?
See https://github.com/openucx/ucx/blob/d510358597f64bdd7c9f41a21e0dac34839d19ef/contrib/test_jenkins.sh#L1003 for example to to build with custom flags and run specific unit tests. I guess we'll need it only when running on AMD CPU?
I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.
BTW, how long does the test run under ASAN?
I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.
BTW, how long does the test run under ASAN?
Under ASAN, it was taking longer time to execute, so I reduced test_window_size to 3K. With that, the execution times are shown below.
a) Without ASAN. [ OK ] test_arch.memcpy (0 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (1838 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5097 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (6077 ms) [----------] 4 tests from test_arch (13012 ms total)
[----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (13013 ms total) [ PASSED ] 4 tests.
b) With ASAN [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from test_arch, where TypeParam = [ RUN ] test_arch.memcpy <> <> [ SKIP ] (RUNNING_ON_VALGRIND || !ucs::perf_retry_count) [ OK ] test_arch.memcpy (1 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (4058 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5818 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (5718 ms) [----------] 4 tests from test_arch (15595 ms total)
I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.
BTW, how long does the test run under ASAN?
Under ASAN, it was taking longer time to execute, so I reduced test_window_size to 3K. With that, the execution times are shown below.
a) Without ASAN. [ OK ] test_arch.memcpy (0 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (1838 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5097 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (6077 ms) [----------] 4 tests from test_arch (13012 ms total)
[----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (13013 ms total) [ PASSED ] 4 tests.
b) With ASAN [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from test_arch, where TypeParam = [ RUN ] test_arch.memcpy <> <> [ SKIP ] (RUNNING_ON_VALGRIND || !ucs::perf_retry_count) [ OK ] test_arch.memcpy (1 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (4058 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5818 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (5718 ms) [----------] 4 tests from test_arch (15595 ms total)
Can we reduce the time 2x more, or skip the test altogether on ASAN?
Can we reduce the time 2x more, or skip the test altogether on ASAN?
I reduced test_window_size to 2K. This is the least minimum size required, it covers the case 'switch_to_nt_store_size' which is currently at '1464'. We cannot go much further.
With that + ASAN, I get the execution time as shown below (total time of tests is less than 10s).
[==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from test_arch, where TypeParam = [ RUN ] test_arch.memcpy <> <> [ SKIP ] (RUNNING_ON_VALGRIND || !ucs::perf_retry_count) [ OK ] test_arch.memcpy (0 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (2131 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (2678 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (2646 ms) [----------] 4 tests from test_arch (7455 ms total)
[----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (7455 ms total) [ PASSED ] 4 tests.