ucx
ucx copied to clipboard
Does the performance match expectations [only about one-fourth of the peak]?
My setup is a single server with 8 H20 GPUs connected via NVLink (NV18 topology). Each link provides about 26 GB/s, so the theoretical aggregate bandwidth is around 400 GB/s. However, my tests only reach ~80 GB/s. Is this expected?
pref test
UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_LOG_LEVEL=info ucx_perftest -t ucp_am_bw -s 1000000 -w 1000 -i 1000000 -m cuda 0.0.0.0 -p 11005
UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_LOG_LEVEL=info ucx_perftest -c 0 -p 11005
output
Waiting for connection...
Accepted connection from 127.0.0.1:53210
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: am bandwidth / message rate |
| Data layout: (automatic) |
| Send memory: cuda |
| Recv memory: cuda |
| Message size: 1000000 |
| Window size: 32 |
| AM header size: 0 |
+----------------------------------------------------------------------------------------------------------+
[1759758378.908137] [TENCENT64:1832716:0] ucp_context.c:2339 UCX INFO Version 1.19.0 (loaded from /apdcephfs_zwfy2/share_303541817/pkuhetu/jiawen/nixl/ucx-1.19.0/install/lib/libucp.so.0)
[1759758381.145819] [TENCENT64:1832716:0] parser.c:2359 UCX WARN unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1759758381.145819] [TENCENT64:1832716:0] parser.c:2359 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1759758381.145832] [TENCENT64:1832716:0] parser.c:2368 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_NET_DEVICES=bond1
[1759758381.153526] [TENCENT64:1832716:0] ucp_worker.c:1903 UCX INFO perftest intra-node cfg#1 rma_am(tcp/bond1) am(tcp/bond1 cuda_ipc/cuda)
[1759758381.193148] [TENCENT64:1832716:0] ucp_worker.c:1903 UCX INFO perftest self cfg#2 rma_am(tcp/bond1) am(tcp/bond1 cuda_copy/cuda)
[1759758375.800815] [TENCENT64:1832718:0] perftest.c:800 UCX WARN CPU affinity is not set (bound to 384 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1759758378.908117] [TENCENT64:1832718:0] ucp_context.c:2339 UCX INFO Version 1.19.0 (loaded from /apdcephfs_zwfy2/share_303541817/pkuhetu/jiawen/nixl/ucx-1.19.0/install/lib/libucp.so.0)
[1759758381.153053] [TENCENT64:1832718:0] parser.c:2359 UCX WARN unused environment variable: UCX_ROOT (maybe: UCX_PROTOS?)
[1759758381.153053] [TENCENT64:1832718:0] parser.c:2359 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1759758381.153069] [TENCENT64:1832718:0] parser.c:2368 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_NET_DEVICES=bond1
[1759758381.153534] [TENCENT64:1832718:0] ucp_worker.c:1903 UCX INFO perftest intra-node cfg#1 rma_am(tcp/bond1) am(tcp/bond1 cuda_ipc/cuda)
[1759758381.193140] [TENCENT64:1832718:0] ucp_worker.c:1903 UCX INFO perftest self cfg#2 rma_am(tcp/bond1) am(tcp/bond1 cuda_copy/cuda)
[thread 0] 70296 11.943 14.210 14.210 67115.13 67115.13 70375 70375
[thread 0] 134879 14.646 15.467 14.811 61659.83 64387.46 64655 67515
[thread 0] 197533 11.853 15.943 15.170 59818.93 62864.62 62725 65918
[thread 0] 268762 11.993 14.024 14.866 68005.44 64149.83 71309 67266
[thread 0] 339665 12.053 14.088 14.704 67691.95 64858.27 70980 68009
[thread 0] 410920 11.923 14.018 14.585 68030.20 65386.93 71335 68563
[thread 0] 481657 12.013 14.121 14.517 67536.17 65693.96 70817 68885
[thread 0] 552682 12.053 14.064 14.459 67811.13 65958.60 71105 69163
[thread 0] 623827 11.953 14.040 14.411 67925.37 66177.13 71225 69392
[thread 0] 694979 12.013 14.039 14.373 67931.92 66352.61 71232 69576
[thread 0] 765932 12.013 14.078 14.346 67742.34 66478.95 71033 69708
[thread 0] 836956 12.043 14.064 14.322 67809.57 66589.83 71103 69825
[thread 0] 907832 11.993 14.093 14.304 67669.01 66672.85 70956 69912
[thread 0] 978864 12.153 14.062 14.286 67817.36 66754.60 71112 69997
Final: 1000000 12.083 14.122 14.283 67528.77 66770.78 70809 70014
GPU 0: NVIDIA H20 (UUID: GPU-31641892-7df4-6434-99c4-d81fe5a57930)
Link 0: 26.562 GB/s
Link 1: 26.562 GB/s
Link 2: 26.562 GB/s
Link 3: 26.562 GB/s
Link 4: 26.562 GB/s
Link 5: 26.562 GB/s
Link 6: 26.562 GB/s
Link 7: 26.562 GB/s
Link 8: 26.562 GB/s
Link 9: 26.562 GB/s
Link 10: 26.562 GB/s
Link 11: 26.562 GB/s
Link 12: 26.562 GB/s
Link 13: 26.562 GB/s
Link 14: 26.562 GB/s
Link 15: 26.562 GB/s
Link 16: 26.562 GB/s
Link 17: 26.562 GB/s
GPU 1: NVIDIA H20 (UUID: GPU-65fe466d-80f5-84c9-714c-114170c0bb1e)
Link 0: 26.562 GB/s
Link 1: 26.562 GB/s
Link 2: 26.562 GB/s
Link 3: 26.562 GB/s
Link 4: 26.562 GB/s
Link 5: 26.562 GB/s
Link 6: 26.562 GB/s
Link 7: 26.562 GB/s
Link 8: 26.562 GB/s
Link 9: 26.562 GB/s
Link 10: 26.562 GB/s
Link 11: 26.562 GB/s
Link 12: 26.562 GB/s
Link 13: 26.562 GB/s
Link 14: 26.562 GB/s
Link 15: 26.562 GB/s
Link 16: 26.562 GB/s
Link 17: 26.562 GB/s
I really hope to get your help.