ucx Performance regression UCX 1.10.0 -> 1.14.0 on AMD system

Hi,

We are sometimes relying on the UD transport instead of DC to lower the memory consumption. We are seeing a performance regression between 1.10.0 and 1.14.0 with UD for our application.

For instance, on 4 nodes with 2x AMD EPYC 7532 and ConnectX-6 and MOFED 5.5 we are seeing the following:

	`UCX_TLS=self,sm,ud`	`UCX_TLS=self,sm,dc`
UCX 1.10.0 elapsed time	0.82 s	0.80 s
UCX 1.14.0 elapsed time	0.85 s	0.80 s

UCX 1.10.0 memory	87.5 GB	97.1 GB
UCX 1.14.0 memory	87.0 GB	94.4 GB

Similar behavior is seen for other node counts and tests cases.

We understand that DC is your preferred transport, however it would still be good to not have any regressions when using UD. Are there any settings we could apply to 1.14.0 to restore the old behavior?

Thanks, Moritz

May 05 '23 06:05 mkre

The regression is still present with 1.15.0-rc1. I have verified using the binary distribution ucx-1.15.0-rc1-centos8-mofed5-cuda11-x86_64.tar.bz2 to rule out any build/configuration issues on my end.

May 11 '23 12:05 mkre

@mkre that you for submitting the issue. Could you identify (bisect) the exact commit that caused the performance degradation?

May 11 '23 12:05 yosefe

@yosefe, I can try. Please give me some time.

May 11 '23 12:05 mkre

@yosefe, I did a git bisect and interestingly, it points to commit 7bafc7c201ec14d40e9526fa1f77325fe8d473d2. Does that make any sense to you? I have to say that the bisection process wasn't crystal clear black and white, so the regression might have crept in over several commits.

If that commit doesn't make sense at all, let me know and I will repeat the exercise.

May 11 '23 15:05 mkre

@mkre This commit added CPU frequency measurement during initialization, it might have added 0.03s overhead. When using only DC this initialization would not be triggered. Can you pls run a longer test to see if the relative effect becomes smaller?

May 12 '23 10:05 yosefe

The reported timings are average timings over multiple iterations, not including initialization or startup time. That is, we call MPI_Init(), let the code run for some warmup iterations, then for some measured iterations (reported), before calling MPI_Finalize(). Is there any chance that the overhead might be visible for post-initialization as well? If not, I can go ahead and try to do a better bisection.

May 12 '23 11:05 mkre

The reported timings are average timings over multiple iterations, not including initialization or startup time. That is, we call MPI_Init(), let the code run for some warmup iterations, then for some measured iterations (reported), before calling MPI_Finalize(). Is there any chance that the overhead might be visible for post-initialization as well? If not, I can go ahead and try to do a better bisection.

No, that initialization should happen once during MPI_Init

May 12 '23 11:05 yosefe

Hi @yosefe, I did another bisection using a different test case and it again pointed me to 7bafc7c201ec14d40e9526fa1f77325fe8d473d2. Maybe the changed frequency measurement leads to some changed tuning parameters causing the regression?

May 17 '23 15:05 mkre

Reverting the commit helped on my small-scale bisection tests, but not on larger scales. I will do another bisection on a larger node count. There must be another offending commit. Stay tuned.

May 24 '23 15:05 mkre

Hi @yosefe,

I finally got back to this. I did another bisection using a larger case ("jetnoise") on four nodes of our AMD system. The regression is pretty severe for this case, and also visible with UCX_TLS=all. This time, it has pointed me at b8b9218b80ea6e6ac658b5e9d0d22f56102bbe44 to be causing the regression. I think that makes more sense than the previously identified commit, as we are running on an AMD system and the commit addresses these systems specifically. For reference, here are my bisection results:

Version	Time (s)	`git bisect`
v1.10.0	4.67	good
v1.14.0	6.06	bad
1aa7505db782bea95aa004853575919ff35a6bfd	4.71	good
5826e6f9cb6ad4008cdf3330de60e00b46addd0d	4.68	good
b92bd4c7bd11e9508bd1864ab036a33986646909	4.69	good
82049a7b28543c6b69a2e8c4bb959cd299fcff13	5.99	bad
009b2457e40cb3ce260cabb5e50cc96f40fbdb18	6.01	bad
e59d2c4c42d635608b470ee0f4c09d8ace22dc70	5.97	bad
186aa9ab76bf4b00021f68af0738ffc452134810	4.72	good
7249eb189b5cbfe3e7a80e6ef36569644911231a	5.87	bad
d1cf2ae6286ff254898a86e588bfc11c0ce4c51b	4.73	good
ce4a8cf13b59c28792788f8bcac86dbfcbe5c3cc	4.70	good
51565f2bb1fe0e20b20296baa06379909daeb311	5.84	bad
b8b9218b80ea6e6ac658b5e9d0d22f56102bbe44	5.91	bad
59c66aa4c63baf9ffe6015a42820a902b55b2fde	5.90	bad

The runtime with v1.14.0 and reverted 59c66aa4c63baf9ffe6015a42820a902b55b2fde is 4.72 s, so back at the v1.10.0 performance. The runtime with v1.15.0-rc6 and reverted 59c66aa4c63baf9ffe6015a42820a902b55b2fde is 4.62 s, so no regressions between v1.14.0 and v1.15.0 for this case.

Pinging @huaxiangfan as the author of that commit.

Sep 21 '23 14:09 mkre

@mkre does setting UCX_IB_PCI_RELAXED_ORDERING=n with v1.14.0 and/or v1.15.0 improve the performance to the expected level?

Sep 21 '23 14:09 yosefe

Yes, it does. I will run our entire test suite with that setting.

Sep 21 '23 14:09 mkre

@yosefe, seems like fb6401c8c760b2601368f9209dfc3527b87cc71c might be another offender (but not sure yet). Is there an environment variable to roll back to the old behavior (revert doesn't work) so that I can confirm?

Sep 26 '23 09:09 mkre

@mkre it's unlikely this PR would affect UD transport since it doesn't support PUT_ZCOPY

Sep 26 '23 09:09 yosefe

@yosefe, we decided to go ahead with patching out 7bafc7c201ec14d40e9526fa1f77325fe8d473d2 from 1.14.0 and setting UCX_IB_PCI_RELAXED_ORDERING=n to avoid any regressions with the new UCX version. The PCI relaxed ordering regression seems pretty clear to me. The regression from 7bafc7c201ec14d40e9526fa1f77325fe8d473d2 still seems a bit odd, but maybe some downstream tuning logic depends on the frequency measurement, and the new measurement methodology leads to different tuning. Are you intending to fix the behavior on your end?

Oct 13 '23 13:10 mkre

@yosefe With UCX 1.16.0, we still need to set UCX_IB_PCI_RELAXED_ORDERING=n on our AMD cluster to avoid performance regressions. Are you planning to address this issue?

May 15 '24 08:05 mkre

@edgargabriel WDYT should be the default setting of UCX_IB_PCI_RELAXED_ORDERING on AMD CPU?

May 15 '24 08:05 yosefe

@arun-chandran-edarath you are on the CPU side of the operations, can you find out what the recommendation from our side would be?

May 15 '24 13:05 edgargabriel

@yosefe @edgargabriel I will take a look at this issue. Thanks.

May 15 '24 13:05 arun-chandran-edarath

@yosefe, @mkre, @huaxiangfan I have some questions related to the original PR #8062 which introduced relaxed ordering on a subset of AMD cpus(Naples, Rome and Milan) and the observed regression (but please note that I am not an InfiniBand (IB) expert).

In the comment by @huaxiangfan on the PR, it is mentioned that with the relaxed-order set, a performance improvement of achieving 200Gbps (line rate) was observed, compared to 150Gbps without it.

a) Can anyone please clarify if this achievement was claimed on the DC or UD transport? b) Is this improvement based on the results of a benchmarking suite or a real application?

Additionally, @mkre mentioned that the same commit causing the performance drop is observed only with UD (DC is fine) on AMD EPYC 7532 nodes based on the Rome architecture. It was mentioned that setting UCX_IB_PCI_RELAXED_ORDERING=n in the command line resolves this performance drop.

c) Would rolling back the relaxed ordering to "no" in ucs_cpu_prefer_relaxed_order() have any impact on existing DC users (or any user)? d) Does setting relaxed ordering to "yes" cause this regression only on certain IB devices, or does it affect all UD transports?

May 15 '24 16:05 arun-chandran-edarath

@yosefe, @mkre, @huaxiangfan I have some questions related to the original PR #8062 which introduced relaxed ordering on a subset of AMD cpus(Naples, Rome and Milan) and the observed regression (but please note that I am not an InfiniBand (IB) expert).

In the comment by @huaxiangfan on the PR, it is mentioned that with the relaxed-order set, a performance improvement of achieving 200Gbps (line rate) was observed, compared to 150Gbps without it.

a) Can anyone please clarify if this achievement was claimed on the DC or UD transport? b) Is this improvement based on the results of a benchmarking suite or a real application?

Additionally, @mkre mentioned that the same commit causing the performance drop is observed only with UD (DC is fine) on AMD EPYC 7532 nodes based on the Rome architecture. It was mentioned that setting UCX_IB_PCI_RELAXED_ORDERING=n in the command line resolves this performance drop.

c) Would rolling back the relaxed ordering to "no" in ucs_cpu_prefer_relaxed_order() have any impact on existing DC users (or any user)? d) Does setting relaxed ordering to "yes" cause this regression only on certain IB devices, or does it affect all UD transports?

AFAIK, relaxed order is needed to get full PCIe gen4 bandwidth on some recent AMD Zen CPUs. That can be seen with RC and DC transports, even with basic benchmarks such as http://github.com/linux-rdma/perftest

May 16 '24 07:05 yosefe

Additionally, @mkre mentioned that the same commit causing the performance drop is observed only with UD (DC is fine) on AMD EPYC 7532 nodes based on the Rome architecture.

@arun-chandran-edarath Two separate issues leading to performance degradation were addressed in this ticket

relaxed ordering
the mysterious 7bafc7c201ec14d40e9526fa1f77325fe8d473d2

It appears like the "relaxed ordering" issue affects both the UD and the DC transport, while the second issue was specific to UD. Sorry for the confusion.

With UCX 1.16, we are not applying any non-default UCX transport settings (i.e. we always run with UCX_TLS=all which should eventually select DC), still the performance is significantly better when disabling relaxed ordering.

May 16 '24 13:05 mkre

@mkre is this issue seen only on AMD EPYC 7532 nodes? [Trying to understand whether it affects only Rome CPUs]

May 16 '24 14:05 arun-chandran-edarath

@mkre, could you please try running the performance benchmark http://github.com/linux-rdma/perftest on these nodes? Did you observe a performance drop in the benchmark results with RO=1 for both UD and DC?

I have been discussing this internally with our hardware experts, and they suggest that the potential cause for the drop in performance with UD could be related to packets being delivered out-of-order when RO=1 (when relaxed ordering is set to true), which may require additional time for software reordering. Does this explanation make sense to you?

Regarding DC, they were uncertain about the reason for the performance drop.

May 17 '24 04:05 arun-chandran-edarath

@arun-chandran-edarath,

is this issue seen only on AMD EPYC 7532 nodes? [Trying to understand whether it affects only Rome CPUs]

I don't know, as I don't have any other AMD InfiniBand systems available to test on.

could you please try running the performance benchmark http://github.com/linux-rdma/perftest on these nodes?

Please advise which tests exactly I should run.

Did you observe a performance drop in the benchmark results with RO=1 for both UD and DC?

I ran a bunch of tests. Reported is the elapsed time (lower is better) on two nodes using the IFRF test case (for internal reference). We can see the same effect for other test cases and more pronounced on larger node counts, but this test should demonstrate the overall behavior:

Configuration	Run #⁠1	Run #⁠2
`UCX_IB_PCI_RELAXED_ORDERING=y UCX_TLS=self,sm,dc`	2.82 s	2.82 s
`UCX_IB_PCI_RELAXED_ORDERING=n UCX_TLS=self,sm,dc`	2.78 s	2.78 s
`UCX_IB_PCI_RELAXED_ORDERING=y UCX_TLS=self,sm,ud`	2.85 s	2.84 s
`UCX_IB_PCI_RELAXED_ORDERING=n UCX_TLS=self,sm,ud`	2.83 s	2.85 s

Relaxed ordering doesn't seem to have an effect for UD, but it has a repeatable effect for DC.

May 17 '24 07:05 mkre

In the below tests I am not particularly sure which test is relevant with "RO=1" for the DMA from NIC to memory. @yosefe @huaxiangfan could you please suggest? , meanwhile I will check internally

* Send        - ib_send_bw and ib_send_lat
* RDMA Read   - ib_read_bw and ib_read_lat
* RDMA Write  - ib_write_bw and ib_write_lat
* RDMA Atomic - ib_atomic_bw and ib_atomic_lat
* Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat

May 17 '24 08:05 arun-chandran-edarath

Pls see --disable_pcie_relaxed parameter to ib_write_bw / ib_read_bw / ib_send_bw tests

May 17 '24 08:05 yosefe

So far I could not see any reliable improvement with --disable_pcie_relaxed in my tests (tried a few different parameters, bw / latency measurements, different transports). The parameter space is quite large, so if you could recommend a few specific things to try I can report the results here.

May 17 '24 09:05 mkre

So far I could not see any reliable improvement with --disable_pcie_relaxed in my tests (tried a few different parameters, bw / latency measurements, different transports). The parameter space is quite large, so if you could recommend a few specific things to try I can report the results here.

@yosefe @mkre that means the raw_bandwidth tests don't show any degradation in performance with RO=1. Now to evaluate the effect of "RO" in ucx layer what tests do you suggest? ucx_perftest or the OSU-Benchmark tests?

May 17 '24 17:05 arun-chandran-edarath

@edgargabriel WDYT should be the default setting of UCX_IB_PCI_RELAXED_ORDERING on AMD CPU?

@yosefe @edgargabriel I checked this internally, the recommended default value for UCX_IB_PCI_RELAXED_ORDERING on AMD CPUs is "yes". Therefore, I suggest setting the default value accordingly.

May 22 '24 06:05 arun-chandran-edarath

ucx ucx copied to clipboard

Performance regression UCX 1.10.0 -> 1.14.0 on AMD system

ucx
ucx copied to clipboard