ucx
ucx copied to clipboard
Performance regression UCX 1.10.0 -> 1.14.0 on AMD system
Hi,
We are sometimes relying on the UD transport instead of DC to lower the memory consumption. We are seeing a performance regression between 1.10.0 and 1.14.0 with UD for our application.
For instance, on 4 nodes with 2x AMD EPYC 7532 and ConnectX-6 and MOFED 5.5 we are seeing the following:
UCX_TLS=self,sm,ud |
UCX_TLS=self,sm,dc |
|
---|---|---|
UCX 1.10.0 elapsed time | 0.82 s | 0.80 s |
UCX 1.14.0 elapsed time | 0.85 s | 0.80 s |
UCX 1.10.0 memory | 87.5 GB | 97.1 GB |
UCX 1.14.0 memory | 87.0 GB | 94.4 GB |
Similar behavior is seen for other node counts and tests cases.
We understand that DC is your preferred transport, however it would still be good to not have any regressions when using UD. Are there any settings we could apply to 1.14.0 to restore the old behavior?
Thanks, Moritz
The regression is still present with 1.15.0-rc1. I have verified using the binary distribution ucx-1.15.0-rc1-centos8-mofed5-cuda11-x86_64.tar.bz2 to rule out any build/configuration issues on my end.
@mkre that you for submitting the issue. Could you identify (bisect) the exact commit that caused the performance degradation?
@yosefe, I can try. Please give me some time.
@yosefe, I did a git bisect
and interestingly, it points to commit 7bafc7c201ec14d40e9526fa1f77325fe8d473d2. Does that make any sense to you? I have to say that the bisection process wasn't crystal clear black and white, so the regression might have crept in over several commits.
If that commit doesn't make sense at all, let me know and I will repeat the exercise.
@mkre This commit added CPU frequency measurement during initialization, it might have added 0.03s overhead. When using only DC this initialization would not be triggered. Can you pls run a longer test to see if the relative effect becomes smaller?
The reported timings are average timings over multiple iterations, not including initialization or startup time. That is, we call MPI_Init()
, let the code run for some warmup iterations, then for some measured iterations (reported), before calling MPI_Finalize()
. Is there any chance that the overhead might be visible for post-initialization as well? If not, I can go ahead and try to do a better bisection.
The reported timings are average timings over multiple iterations, not including initialization or startup time. That is, we call
MPI_Init()
, let the code run for some warmup iterations, then for some measured iterations (reported), before callingMPI_Finalize()
. Is there any chance that the overhead might be visible for post-initialization as well? If not, I can go ahead and try to do a better bisection.
No, that initialization should happen once during MPI_Init
Hi @yosefe, I did another bisection using a different test case and it again pointed me to 7bafc7c201ec14d40e9526fa1f77325fe8d473d2. Maybe the changed frequency measurement leads to some changed tuning parameters causing the regression?
Reverting the commit helped on my small-scale bisection tests, but not on larger scales. I will do another bisection on a larger node count. There must be another offending commit. Stay tuned.
Hi @yosefe,
I finally got back to this. I did another bisection using a larger case ("jetnoise") on four nodes of our AMD system. The regression is pretty severe for this case, and also visible with UCX_TLS=all
. This time, it has pointed me at b8b9218b80ea6e6ac658b5e9d0d22f56102bbe44 to be causing the regression. I think that makes more sense than the previously identified commit, as we are running on an AMD system and the commit addresses these systems specifically. For reference, here are my bisection results:
Version | Time (s) | git bisect |
---|---|---|
v1.10.0 | 4.67 | good |
v1.14.0 | 6.06 | bad |
1aa7505db782bea95aa004853575919ff35a6bfd | 4.71 | good |
5826e6f9cb6ad4008cdf3330de60e00b46addd0d | 4.68 | good |
b92bd4c7bd11e9508bd1864ab036a33986646909 | 4.69 | good |
82049a7b28543c6b69a2e8c4bb959cd299fcff13 | 5.99 | bad |
009b2457e40cb3ce260cabb5e50cc96f40fbdb18 | 6.01 | bad |
e59d2c4c42d635608b470ee0f4c09d8ace22dc70 | 5.97 | bad |
186aa9ab76bf4b00021f68af0738ffc452134810 | 4.72 | good |
7249eb189b5cbfe3e7a80e6ef36569644911231a | 5.87 | bad |
d1cf2ae6286ff254898a86e588bfc11c0ce4c51b | 4.73 | good |
ce4a8cf13b59c28792788f8bcac86dbfcbe5c3cc | 4.70 | good |
51565f2bb1fe0e20b20296baa06379909daeb311 | 5.84 | bad |
b8b9218b80ea6e6ac658b5e9d0d22f56102bbe44 | 5.91 | bad |
59c66aa4c63baf9ffe6015a42820a902b55b2fde | 5.90 | bad |
The runtime with v1.14.0 and reverted 59c66aa4c63baf9ffe6015a42820a902b55b2fde is 4.72 s, so back at the v1.10.0
performance.
The runtime with v1.15.0-rc6 and reverted 59c66aa4c63baf9ffe6015a42820a902b55b2fde is 4.62 s, so no regressions between v1.14.0 and v1.15.0 for this case.
Pinging @huaxiangfan as the author of that commit.
@mkre does setting UCX_IB_PCI_RELAXED_ORDERING=n
with v1.14.0 and/or v1.15.0 improve the performance to the expected level?
Yes, it does. I will run our entire test suite with that setting.
@yosefe, seems like fb6401c8c760b2601368f9209dfc3527b87cc71c might be another offender (but not sure yet). Is there an environment variable to roll back to the old behavior (revert doesn't work) so that I can confirm?
@mkre it's unlikely this PR would affect UD transport since it doesn't support PUT_ZCOPY
@yosefe, we decided to go ahead with patching out 7bafc7c201ec14d40e9526fa1f77325fe8d473d2 from 1.14.0 and setting UCX_IB_PCI_RELAXED_ORDERING=n
to avoid any regressions with the new UCX version. The PCI relaxed ordering regression seems pretty clear to me. The regression from 7bafc7c201ec14d40e9526fa1f77325fe8d473d2 still seems a bit odd, but maybe some downstream tuning logic depends on the frequency measurement, and the new measurement methodology leads to different tuning.
Are you intending to fix the behavior on your end?
@yosefe With UCX 1.16.0, we still need to set UCX_IB_PCI_RELAXED_ORDERING=n
on our AMD cluster to avoid performance regressions. Are you planning to address this issue?
@edgargabriel WDYT should be the default setting of UCX_IB_PCI_RELAXED_ORDERING on AMD CPU?
@arun-chandran-edarath you are on the CPU side of the operations, can you find out what the recommendation from our side would be?
@yosefe @edgargabriel I will take a look at this issue. Thanks.
@yosefe, @mkre, @huaxiangfan I have some questions related to the original PR #8062 which introduced relaxed ordering on a subset of AMD cpus(Naples, Rome and Milan) and the observed regression (but please note that I am not an InfiniBand (IB) expert).
In the comment by @huaxiangfan on the PR, it is mentioned that with the relaxed-order set, a performance improvement of achieving 200Gbps (line rate) was observed, compared to 150Gbps without it.
a) Can anyone please clarify if this achievement was claimed on the DC or UD transport? b) Is this improvement based on the results of a benchmarking suite or a real application?
Additionally, @mkre mentioned that the same commit causing the performance drop is observed only with UD (DC is fine) on AMD EPYC 7532 nodes based on the Rome architecture. It was mentioned that setting UCX_IB_PCI_RELAXED_ORDERING=n in the command line resolves this performance drop.
c) Would rolling back the relaxed ordering to "no" in ucs_cpu_prefer_relaxed_order() have any impact on existing DC users (or any user)? d) Does setting relaxed ordering to "yes" cause this regression only on certain IB devices, or does it affect all UD transports?
@yosefe, @mkre, @huaxiangfan I have some questions related to the original PR #8062 which introduced relaxed ordering on a subset of AMD cpus(Naples, Rome and Milan) and the observed regression (but please note that I am not an InfiniBand (IB) expert).
In the comment by @huaxiangfan on the PR, it is mentioned that with the relaxed-order set, a performance improvement of achieving 200Gbps (line rate) was observed, compared to 150Gbps without it.
a) Can anyone please clarify if this achievement was claimed on the DC or UD transport? b) Is this improvement based on the results of a benchmarking suite or a real application?
Additionally, @mkre mentioned that the same commit causing the performance drop is observed only with UD (DC is fine) on AMD EPYC 7532 nodes based on the Rome architecture. It was mentioned that setting UCX_IB_PCI_RELAXED_ORDERING=n in the command line resolves this performance drop.
c) Would rolling back the relaxed ordering to "no" in ucs_cpu_prefer_relaxed_order() have any impact on existing DC users (or any user)? d) Does setting relaxed ordering to "yes" cause this regression only on certain IB devices, or does it affect all UD transports?
AFAIK, relaxed order is needed to get full PCIe gen4 bandwidth on some recent AMD Zen CPUs. That can be seen with RC and DC transports, even with basic benchmarks such as http://github.com/linux-rdma/perftest
Additionally, @mkre mentioned that the same commit causing the performance drop is observed only with UD (DC is fine) on AMD EPYC 7532 nodes based on the Rome architecture.
@arun-chandran-edarath Two separate issues leading to performance degradation were addressed in this ticket
- relaxed ordering
- the mysterious 7bafc7c201ec14d40e9526fa1f77325fe8d473d2
It appears like the "relaxed ordering" issue affects both the UD and the DC transport, while the second issue was specific to UD. Sorry for the confusion.
With UCX 1.16, we are not applying any non-default UCX transport settings (i.e. we always run with UCX_TLS=all
which should eventually select DC), still the performance is significantly better when disabling relaxed ordering.
@mkre is this issue seen only on AMD EPYC 7532 nodes? [Trying to understand whether it affects only Rome CPUs]
@mkre, could you please try running the performance benchmark http://github.com/linux-rdma/perftest on these nodes? Did you observe a performance drop in the benchmark results with RO=1 for both UD and DC?
I have been discussing this internally with our hardware experts, and they suggest that the potential cause for the drop in performance with UD could be related to packets being delivered out-of-order when RO=1 (when relaxed ordering is set to true), which may require additional time for software reordering. Does this explanation make sense to you?
Regarding DC, they were uncertain about the reason for the performance drop.
@arun-chandran-edarath,
is this issue seen only on AMD EPYC 7532 nodes? [Trying to understand whether it affects only Rome CPUs]
I don't know, as I don't have any other AMD InfiniBand systems available to test on.
could you please try running the performance benchmark http://github.com/linux-rdma/perftest on these nodes?
Please advise which tests exactly I should run.
Did you observe a performance drop in the benchmark results with RO=1 for both UD and DC?
I ran a bunch of tests. Reported is the elapsed time (lower is better) on two nodes using the IFRF test case (for internal reference). We can see the same effect for other test cases and more pronounced on larger node counts, but this test should demonstrate the overall behavior:
Configuration | Run #1 | Run #2 |
---|---|---|
UCX_IB_PCI_RELAXED_ORDERING=y UCX_TLS=self,sm,dc |
2.82 s | 2.82 s |
UCX_IB_PCI_RELAXED_ORDERING=n UCX_TLS=self,sm,dc |
2.78 s | 2.78 s |
UCX_IB_PCI_RELAXED_ORDERING=y UCX_TLS=self,sm,ud |
2.85 s | 2.84 s |
UCX_IB_PCI_RELAXED_ORDERING=n UCX_TLS=self,sm,ud |
2.83 s | 2.85 s |
Relaxed ordering doesn't seem to have an effect for UD, but it has a repeatable effect for DC.
In the below tests I am not particularly sure which test is relevant with "RO=1" for the DMA from NIC to memory. @yosefe @huaxiangfan could you please suggest? , meanwhile I will check internally
* Send - ib_send_bw and ib_send_lat
* RDMA Read - ib_read_bw and ib_read_lat
* RDMA Write - ib_write_bw and ib_write_lat
* RDMA Atomic - ib_atomic_bw and ib_atomic_lat
* Native Ethernet (when working with MOFED2) - raw_ethernet_bw, raw_ethernet_lat
Pls see --disable_pcie_relaxed
parameter to ib_write_bw / ib_read_bw / ib_send_bw tests
So far I could not see any reliable improvement with --disable_pcie_relaxed
in my tests (tried a few different parameters, bw / latency measurements, different transports). The parameter space is quite large, so if you could recommend a few specific things to try I can report the results here.
So far I could not see any reliable improvement with --disable_pcie_relaxed in my tests (tried a few different parameters, bw / latency measurements, different transports). The parameter space is quite large, so if you could recommend a few specific things to try I can report the results here.
@yosefe @mkre that means the raw_bandwidth tests don't show any degradation in performance with RO=1. Now to evaluate the effect of "RO" in ucx layer what tests do you suggest? ucx_perftest or the OSU-Benchmark tests?
@edgargabriel WDYT should be the default setting of UCX_IB_PCI_RELAXED_ORDERING on AMD CPU?
@yosefe @edgargabriel I checked this internally, the recommended default value for UCX_IB_PCI_RELAXED_ORDERING on AMD CPUs is "yes". Therefore, I suggest setting the default value accordingly.