ucx ARCH/X86: Introduce non temporal buffer transfer

Processors based on AMD's Zen 3/4 architecture typically organize CPU cores into clusters of 8 or more cores. Within each cluster, these cores share a unified L3 cache, optimizing data transfer speeds among them. However, when data needs to travel between cores located in different clusters that do not share a common L3 cache, the transfer speed can be significantly reduced, potentially becoming 2-3 orders of magnitude slower.

To leverage this microarchitecture effectively MPI applications often employ 'n-ranks X m-threads' configuration.

n <= Number of unified L3caches in system m >= Number of cores within a cluster

When UCX is used as the underlying data transfer mechanism, depending on the rendezvous threshold two ranks/processes use either CICO(copy-in copy-out) or SCOPY(single copy) to move data between them in an intra-node scenario. This patch improves the data transfer speed/application-performance on AMD's Zen 3/4 architecture for both CICO and SCOPY when sender and receiver processes are pinned to cores that don't share a common L3 cache.

a)CICO

          Sender                      Receiver
     +-------------+               +------------+
     +.............+               +............+
     +    sbuf     +               +   rbuf     +
     +.............+               +............+
     +             +               +            +
     +             +               +            +
     +             +               +            +
     +.............+>>>>>>>>>>>>>>>+............+
     +   rshmem    +  POSIX/SYSV   +            +
     +   mapped    +   mapping     +  rshmem    +
     +   area      +               +            +
     +.............+<<<<<<<<<<<<<<<+............+
     +             +               +            +
     +-------------+               +------------+

In the current method, the sender process first moves the data from source buffer(sbuf) to the remote shared memory (rshmem) using the glibc's memcpy and informs the receiver process about the presence of new data. The receiver process then proceeds to copy the data from rshmem to its local buffer(rbuf) once again utilizing glibc's memcpy, thus completing the transfer.

In contrast, the new approach optimizes this by using distinct memory copy techniques. It utilizes 'nontemporal store' or 'store with PREFETCHNTA' when copying data from sbuf to rshmem and employs 'loads with PREFETCHNTA' when copying data from rshmem to rbuf. This optimization effectively reduces data transfer latency by circumventing cache-to-cache data transfers, which tend to be slower, especially when cores are situated in different clusters. Additionally, this method helps in minimizing cache pollution on both the sender and receiver sides, further enhancing overall performance.

b)SCOPY (only xpmem is considered as other methods use memcpy inside kernel)

          Sender                          Receiver
     +-------------+               +--------------------+
     +             +               +....................+
     +             +               +       rbuf         +
     +             +               +....................+
     +             +               +                    +
     +             +               +                    +
     +             +               +                    +
     +.............+>>>>>>>>>>>>>>>+....................+
     +    sbuf     +               +   sbuf_xpmem_map   +
     +.............+>>>>>>>>>>>>>>>+                    +
     +             +               + physical pages of  +
     +             +               + sbuf mapped to     +
     +             +               + receiver's virtual +
     +             +               + memory by fault    +
     +             +               + handler in xpmem.ko+
     +             +               +....................+
     +             +               +                    +
     +             +               +                    +
     +-------------+               +--------------------+

The new method replaces glibc's memcpy of the receiver's data from sbuf_xpmem_map (src) to rbuf (dst) with 'loads with PREFETCHNTA'. This minimizes cache pollution inside receiver's core(sbuf pages are mapped with rw flags in the receiver process).

Acked-by: Edgar Gabriel [email protected]

How to build

use "--enable-builtin-memcpy=no --enable-optimizations --enable-nt-buffer-transfer" while configuring

Eg: $./contrib/configure-release --prefix=/install_path --with-xpmem=/xpmem_path --enable-builtin-memcpy=no --enable-optimizations --enable-nt-buffer-transfer

How to run

During mpirun nt-buffer-transfer is dynamically selected according to the table below.

ntbt_selection_criteria

use "-x UCX_NT_BUFFER_TRANSFER_MIN=0" during mpirun for a workload that doesn't have more than one rank in the L3 domain

Below is the comparison of latency when two ranks are in 2 different numa domains (From a Milan:Zen3 node)

cmd: mpirun -np 2 --map-by numa --bind-to core --mca pml ucx --mca opal_common_ucx_tls any --mca opal_common_ucx_devices any -x UCX_TLS=^ib -x UCX_NT_BUFFER_TRANSFER_MIN=0 ./osu_latency -m 1:268435456 -i 50000 -x 5000

latancy_compare

Note: The default threshold values for selecting the nt-buffer-transfer for an individual transfer are set at 3/4 the size of L3 cache to address MPI Binding, where multiple ranks share a common L3 cache. If the MPI workload has only one rank in L3, nt-buffer-transfer can be used for all lengths, and the user can enforce this by passing "-x UCX_NT_BUFFER_TRANSFER_MIN=0" in the command line. .

--Arun

Oct 07 '23 09:10 arun-chandran-edarath

@yosefe I have re-designed it by adding the copy direction in ucx_memcpy_relaxed(). Please see.

Dynamic controls with variables similar to builtin_memcpy_min & builtin_memcpy_max have not yet implemented; I will add them soon. --Arun

Oct 19 '23 10:10 arun-chandran-edarath

pls fix the commit title , see CI failure (would need force push)

Done. I have forcefully pushed after changing the commit titles of all the commits in this series.

Oct 26 '23 09:10 arun-chandran-edarath

@yosefe I am not sure why the builds are failing in the automated tests, on my system I don't get such failures if I build without enabling 'nt-buffer-transfer'. Could you please help to solve it?

Oct 27 '23 08:10 arun-chandran-edarath

@yosefe I am not sure why the builds are failing in the automated tests, on my system I don't get such failures if I build without enabling 'nt-buffer-transfer'. Could you please help to solve it?

This is the error:

checking that generated files are newer than configure... done
configure: error: conditional "HAVE_NT_BUFFER_TRANSFER" was never defined.
Usually this means the macro was only invoked conditionally.

Maybe remove HAVE_NT_BUFFER_TRANSFER altogether?

Oct 27 '23 10:10 yosefe

@yosefe I am not sure why the builds are failing in the automated tests, on my system I don't get such failures if I build without enabling 'nt-buffer-transfer'. Could you please help to solve it?

This is the error:
checking that generated files are newer than configure... done
configure: error: conditional "HAVE_NT_BUFFER_TRANSFER" was never defined.
Usually this means the macro was only invoked conditionally.
Maybe remove HAVE_NT_BUFFER_TRANSFER altogether?

It was missing in configure.ac. I fixed it and pushed, let us see how the build behaves now. Thanks.

Oct 27 '23 13:10 arun-chandran-edarath

@arun-chandran-edarath pls see the merge conflicts

Oct 30 '23 08:10 yosefe

@yosefe I have implemented a dynamic-range-based usage depending on the total length of the transfer. Please take a look.

Nov 23 '23 11:11 arun-chandran-edarath

How to handle the build requirements?

Building it requires the use of --enable-optimizations (which in turn selects the necessary -march and -mavx options). How to specify if 'nt-buffer-transfer' is enabled and then select the extra compiler flags?

Nov 23 '23 15:11 arun-chandran-edarath

Hi All,

I want to collaborate with the folks from intel to get their opinion on setting the default values for all the tunable parameters of the nt-buffer-transfer patch. @yosefe Could you please help me with this request?

--Arun

Jan 23 '24 07:01 arun-chandran-edarath

@yosefe This PR has been sitting idle for a while, can you consider merging it before we branch out for 1.17?

Please let me know if there is anything I can do to help move this forward, or if there are any concerns or issues that need to be addressed.

--Arun

Apr 10 '24 11:04 arun-chandran-edarath

/azp run perf

Apr 14 '24 16:04 yosefe

Azure Pipelines successfully started running 1 pipeline(s).

Apr 14 '24 16:04 azure-pipelines[bot]

2. implementing

in general, to expedite the review process, I'd suggest to break down this PR into 2 parts:

adding hint and total_len to ucs_memcpy_relaxed and adjusting upper layers accordingly

adding read/write prefetch and implementing NT memory copy in UCS

@yosefe, sorry for the delay in my response. I was on vacation.

As per your suggestion, I have broken down the PR into three parts:

Adding total_len and hint to ucs_memcpy_relaxed.
Adding the prefetch functions required for nt-buffer-transfer().
Implementing nt-buffer-transfer() for x86_64. To expedite the review process, I have implemented parts 1 and 2 and pushed the changes. I have addressed most of the comments related to parts 1 and 2 in this push. However, there are some comments that require a more in-depth knowledge of the UCX codebase, which I currently lack. I might need expert help to solve those.

I will work on those comments for part 3 and push the changes as soon as possible.

In the meantime, let's focus on closing parts 1 and 2. --Arun

Apr 17 '24 12:04 arun-chandran-edarath

@arun-chandran-edarath pls avoid force push during review, fix the comments only by adding new commits or merge commits

Apr 18 '24 13:04 yosefe

@arun-chandran-edarath pls avoid force push during review, fix the comments only by adding new commits or merge commits

Thanks for the heads up. I'll avoid force push during review from now on.

Apr 19 '24 08:04 arun-chandran-edarath

@tvegas1 @rakhmets can you pls review as well?

Apr 25 '24 11:04 yosefe

/azp run perf

Apr 25 '24 12:04 yosefe

Azure Pipelines successfully started running 1 pipeline(s).

Apr 25 '24 12:04 azure-pipelines[bot]

just for my understanding, do we have a repro (ucx_perftest, some osu tests..) with perf gain on applicable cpu?

saw description now, any perftest reproducer (maybe with xpmem/shmem) on different and then same CCD?

Apr 25 '24 18:04 tvegas1

/azp run perf

Apr 26 '24 14:04 yosefe

Azure Pipelines successfully started running 1 pipeline(s).

Apr 26 '24 14:04 azure-pipelines[bot]

just for my understanding, do we have a repro (ucx_perftest, some osu tests..) with perf gain on applicable cpu?

Yes, I already shared the osu_latency numbers for a zen3 machine with 32 MB L3 cache for 2 ranks in different numa domains. I will share the numbers for 2 ranks in the same CCD(L3 domain)

Apr 29 '24 07:04 arun-chandran-edarath

zen3_p2p_share.xlsx

just for my understanding, do we have a repro (ucx_perftest, some osu tests..) with perf gain on applicable cpu?

saw description now, any perftest reproducer (maybe with xpmem/shmem) on different and then same CCD?

Please find the excel with osu_latency numbers for 4 different --map-by options (core, l3cache, numa and socket) on a zen3 machine with 32MB L3cache. CICO range is highlighted in yellow and SCOPY(xpmem) range is highlighted in green.

Command used: mpirun -np 2 --map-by "mapping" --bind-to core --mca pml ucx --mca opal_common_ucx_tls any --mca opal_common_ucx_devices any -x UCX_TLS=^ib -x UCX_NT_BUFFER_TRANSFER_MIN=0 ./osu_latency -m 1:268435456 -i 50000 -x 5000

Apr 29 '24 09:04 arun-chandran-edarath

@arun-chandran-edarath code is ok but there are 2 issues with the test:

By default it doesn't run in CI because need to pass "--enable-optimizations" to build with AVX support. Can we add CI step for this - build with optimizations and run test_arch tests?
The test takes very long time to run [[1] because it iterates over (window_size,source_align,dst_align) - need to make it no longer than 10-20 seconds

[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from test_arch, where TypeParam =
[ RUN      ] test_arch.nt_buffer_transfer_nt_dst <> <>
[       OK ] test_arch.nt_buffer_transfer_nt_dst (746050 ms)
[ RUN      ] test_arch.nt_buffer_transfer_nt_src_dst <> <>
[       OK ] test_arch.nt_buffer_transfer_nt_src_dst (743477 ms)
[ RUN      ] test_arch.nt_buffer_transfer_nt_src <> <>
[       OK ] test_arch.nt_buffer_transfer_nt_src (332944 ms)
[----------] 3 tests from test_arch (1822471 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (1822472 ms total)
[  PASSED  ] 3 tests.

May 04 '24 08:05 yosefe

By default it doesn't run in CI because need to pass "--enable-optimizations" to build with AVX support. Can we add CI step for this - build with optimizations and run test_arch tests?

@yosefe @tvegas1 How shall we do it?

The test takes very long time to run [[1] because it iterates over (window_size,source_align,dst_align) - need to make it no longer than 10-20 seconds

I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.

With this I currently get the below execution times. `[ RUN ] test_arch.nt_buffer_transfer_nt_src <> <>

[ OK ] test_arch.nt_buffer_transfer_nt_src (3036 ms)

[ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <>

[ OK ] test_arch.nt_buffer_transfer_nt_dst (9485 ms)

[ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <>

[ OK ] test_arch.nt_buffer_transfer_nt_src_dst (11294 ms)

[----------] 4 tests from test_arch (23815 ms total) `

May 06 '24 06:05 arun-chandran-edarath

@yosefe @tvegas1 How shall we do it?

See https://github.com/openucx/ucx/blob/d510358597f64bdd7c9f41a21e0dac34839d19ef/contrib/test_jenkins.sh#L1003 for example to to build with custom flags and run specific unit tests. I guess we'll need it only when running on AMD CPU?

May 06 '24 07:05 yosefe

I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.

BTW, how long does the test run under ASAN?

May 06 '24 07:05 yosefe

I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.

BTW, how long does the test run under ASAN?

Under ASAN, it was taking longer time to execute, so I reduced test_window_size to 3K. With that, the execution times are shown below.

a) Without ASAN. [ OK ] test_arch.memcpy (0 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (1838 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5097 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (6077 ms) [----------] 4 tests from test_arch (13012 ms total)

[----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (13013 ms total) [ PASSED ] 4 tests.

b) With ASAN [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from test_arch, where TypeParam = [ RUN ] test_arch.memcpy <> <> [ SKIP ] (RUNNING_ON_VALGRIND || !ucs::perf_retry_count) [ OK ] test_arch.memcpy (1 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (4058 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5818 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (5718 ms) [----------] 4 tests from test_arch (15595 ms total)

May 06 '24 09:05 arun-chandran-edarath

I reduced the test_window_size to 4k and alignment to 64. This is enough to test all the scenarios and branches in the code.

BTW, how long does the test run under ASAN?

Under ASAN, it was taking longer time to execute, so I reduced test_window_size to 3K. With that, the execution times are shown below.

a) Without ASAN. [ OK ] test_arch.memcpy (0 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (1838 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5097 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (6077 ms) [----------] 4 tests from test_arch (13012 ms total)

[----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (13013 ms total) [ PASSED ] 4 tests.

b) With ASAN [==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from test_arch, where TypeParam = [ RUN ] test_arch.memcpy <> <> [ SKIP ] (RUNNING_ON_VALGRIND || !ucs::perf_retry_count) [ OK ] test_arch.memcpy (1 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (4058 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (5818 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (5718 ms) [----------] 4 tests from test_arch (15595 ms total)

Can we reduce the time 2x more, or skip the test altogether on ASAN?

May 06 '24 09:05 yosefe

Can we reduce the time 2x more, or skip the test altogether on ASAN?

I reduced test_window_size to 2K. This is the least minimum size required, it covers the case 'switch_to_nt_store_size' which is currently at '1464'. We cannot go much further.

With that + ASAN, I get the execution time as shown below (total time of tests is less than 10s).

[==========] Running 4 tests from 1 test suite. [----------] Global test environment set-up. [----------] 4 tests from test_arch, where TypeParam = [ RUN ] test_arch.memcpy <> <> [ SKIP ] (RUNNING_ON_VALGRIND || !ucs::perf_retry_count) [ OK ] test_arch.memcpy (0 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src (2131 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_dst (2678 ms) [ RUN ] test_arch.nt_buffer_transfer_nt_src_dst <> <> [ OK ] test_arch.nt_buffer_transfer_nt_src_dst (2646 ms) [----------] 4 tests from test_arch (7455 ms total)

[----------] Global test environment tear-down [==========] 4 tests from 1 test suite ran. (7455 ms total) [ PASSED ] 4 tests.

May 06 '24 10:05 arun-chandran-edarath

ucx ucx copied to clipboard

ARCH/X86: Introduce non temporal buffer transfer

How to build

How to run

Below is the comparison of latency when two ranks are in 2 different numa domains (From a Milan:Zen3 node)

ucx
ucx copied to clipboard