ucx icon indicating copy to clipboard operation
ucx copied to clipboard

Does the performance match expectations [only about one-fourth of the peak]?

Open njw1123 opened this issue 3 months ago • 0 comments

My setup is a single server with 8 H20 GPUs connected via NVLink (NV18 topology). Each link provides about 26 GB/s, so the theoretical aggregate bandwidth is around 400 GB/s. However, my tests only reach ~80 GB/s. Is this expected?

pref test

UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_LOG_LEVEL=info ucx_perftest -t ucp_am_bw -s 1000000 -w 1000 -i 1000000  -m cuda 0.0.0.0 -p 11005
UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_LOG_LEVEL=info ucx_perftest -c 0 -p 11005

output

Waiting for connection...
Accepted connection from 127.0.0.1:53210
+----------------------------------------------------------------------------------------------------------+
| API:          protocol layer                                                                             |
| Test:         am bandwidth / message rate                                                                |
| Data layout:  (automatic)                                                                                |
| Send memory:  cuda                                                                                       |
| Recv memory:  cuda                                                                                       |
| Message size: 1000000                                                                                    |
| Window size:  32                                                                                         |
| AM header size: 0                                                                                        |
+----------------------------------------------------------------------------------------------------------+
[1759758378.908137] [TENCENT64:1832716:0]     ucp_context.c:2339 UCX  INFO  Version 1.19.0 (loaded from /apdcephfs_zwfy2/share_303541817/pkuhetu/jiawen/nixl/ucx-1.19.0/install/lib/libucp.so.0)
[1759758381.145819] [TENCENT64:1832716:0]          parser.c:2359 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1759758381.145819] [TENCENT64:1832716:0]          parser.c:2359 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1759758381.145832] [TENCENT64:1832716:0]          parser.c:2368 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_NET_DEVICES=bond1
[1759758381.153526] [TENCENT64:1832716:0]      ucp_worker.c:1903 UCX  INFO    perftest intra-node cfg#1 rma_am(tcp/bond1)  am(tcp/bond1 cuda_ipc/cuda)
[1759758381.193148] [TENCENT64:1832716:0]      ucp_worker.c:1903 UCX  INFO    perftest self cfg#2 rma_am(tcp/bond1)  am(tcp/bond1 cuda_copy/cuda)
[1759758375.800815] [TENCENT64:1832718:0]        perftest.c:800  UCX  WARN  CPU affinity is not set (bound to 384 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1759758378.908117] [TENCENT64:1832718:0]     ucp_context.c:2339 UCX  INFO  Version 1.19.0 (loaded from /apdcephfs_zwfy2/share_303541817/pkuhetu/jiawen/nixl/ucx-1.19.0/install/lib/libucp.so.0)
[1759758381.153053] [TENCENT64:1832718:0]          parser.c:2359 UCX  WARN  unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1759758381.153053] [TENCENT64:1832718:0]          parser.c:2359 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1759758381.153069] [TENCENT64:1832718:0]          parser.c:2368 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_NET_DEVICES=bond1
[1759758381.153534] [TENCENT64:1832718:0]      ucp_worker.c:1903 UCX  INFO    perftest intra-node cfg#1 rma_am(tcp/bond1)  am(tcp/bond1 cuda_ipc/cuda)
[1759758381.193140] [TENCENT64:1832718:0]      ucp_worker.c:1903 UCX  INFO    perftest self cfg#2 rma_am(tcp/bond1)  am(tcp/bond1 cuda_copy/cuda)
[thread 0]             70296     11.943    14.210    14.210    67115.13   67115.13       70375       70375
[thread 0]            134879     14.646    15.467    14.811    61659.83   64387.46       64655       67515
[thread 0]            197533     11.853    15.943    15.170    59818.93   62864.62       62725       65918
[thread 0]            268762     11.993    14.024    14.866    68005.44   64149.83       71309       67266
[thread 0]            339665     12.053    14.088    14.704    67691.95   64858.27       70980       68009
[thread 0]            410920     11.923    14.018    14.585    68030.20   65386.93       71335       68563
[thread 0]            481657     12.013    14.121    14.517    67536.17   65693.96       70817       68885
[thread 0]            552682     12.053    14.064    14.459    67811.13   65958.60       71105       69163
[thread 0]            623827     11.953    14.040    14.411    67925.37   66177.13       71225       69392
[thread 0]            694979     12.013    14.039    14.373    67931.92   66352.61       71232       69576
[thread 0]            765932     12.013    14.078    14.346    67742.34   66478.95       71033       69708
[thread 0]            836956     12.043    14.064    14.322    67809.57   66589.83       71103       69825
[thread 0]            907832     11.993    14.093    14.304    67669.01   66672.85       70956       69912
[thread 0]            978864     12.153    14.062    14.286    67817.36   66754.60       71112       69997
Final:               1000000     12.083    14.122    14.283    67528.77   66770.78       70809       70014
GPU 0: NVIDIA H20 (UUID: GPU-31641892-7df4-6434-99c4-d81fe5a57930)

         Link 0: 26.562 GB/s
         Link 1: 26.562 GB/s
         Link 2: 26.562 GB/s
         Link 3: 26.562 GB/s
         Link 4: 26.562 GB/s
         Link 5: 26.562 GB/s
         Link 6: 26.562 GB/s
         Link 7: 26.562 GB/s
         Link 8: 26.562 GB/s
         Link 9: 26.562 GB/s
         Link 10: 26.562 GB/s
         Link 11: 26.562 GB/s
         Link 12: 26.562 GB/s
         Link 13: 26.562 GB/s
         Link 14: 26.562 GB/s
         Link 15: 26.562 GB/s
         Link 16: 26.562 GB/s
         Link 17: 26.562 GB/s
GPU 1: NVIDIA H20 (UUID: GPU-65fe466d-80f5-84c9-714c-114170c0bb1e)
         Link 0: 26.562 GB/s
         Link 1: 26.562 GB/s
         Link 2: 26.562 GB/s
         Link 3: 26.562 GB/s
         Link 4: 26.562 GB/s
         Link 5: 26.562 GB/s
         Link 6: 26.562 GB/s
         Link 7: 26.562 GB/s
         Link 8: 26.562 GB/s
         Link 9: 26.562 GB/s
         Link 10: 26.562 GB/s
         Link 11: 26.562 GB/s
         Link 12: 26.562 GB/s
         Link 13: 26.562 GB/s
         Link 14: 26.562 GB/s
         Link 15: 26.562 GB/s
         Link 16: 26.562 GB/s
         Link 17: 26.562 GB/s

I really hope to get your help.

njw1123 avatar Oct 06 '25 13:10 njw1123