gdrcopy
gdrcopy copied to clipboard
support AVX-512 instructions
Using AVX-512 based memcpy is a bad idea, in general.
This is how gdr_copy_from_mapping does with AVX512 (In fact, its SSE4.1 version is faster than its AVX version, and the source code prefers it over the AVX version).
gdr_copy_from_mapping num iters for each size: 100
Test Size(B) Avg.Time(us)
DBG: using AVX512 implementation of gdr_copy_from_bar
gdr_copy_from_mapping 1 0.9811
gdr_copy_from_mapping 2 1.2646
gdr_copy_from_mapping 4 1.2648
gdr_copy_from_mapping 8 1.2640
gdr_copy_from_mapping 16 1.8958
gdr_copy_from_mapping 32 3.1540
gdr_copy_from_mapping 64 0.6476
gdr_copy_from_mapping 128 1.2858
gdr_copy_from_mapping 256 2.5581
gdr_copy_from_mapping 512 5.0851
gdr_copy_from_mapping 1024 10.2162
gdr_copy_from_mapping 2048 24.0402
gdr_copy_from_mapping 4096 44.5810
gdr_copy_from_mapping 8192 81.9428
gdr_copy_from_mapping 16384 170.7200
gdr_copy_from_mapping 32768 341.2040
gdr_copy_from_mapping 65536 675.1082
gdr_copy_from_mapping 131072 1357.5815
gdr_copy_from_mapping 262144 2706.2129
gdr_copy_from_mapping 524288 5425.6831
gdr_copy_from_mapping 1048576 10837.6549
gdr_copy_from_mapping 2097152 21672.5916
gdr_copy_from_mapping 4194304 55437.2406
gdr_copy_from_mapping 8388608 110991.1427
gdr_copy_from_mapping 16777216 222043.6687
Thank you for taking a look. Which CPU, GPU and PCIe topology did you test? Can you report copy_to_mapping perf ?
Thanks for your response!
CPU - Intel Xeon Silver 4114 (Skylake) GPU - Tesla P100-PCIE-12GB CUDA version - 11.4
Here are the gdr_copy_to_mapping numbers for AVX512 -
gdr_copy_to_mapping num iters for each size: 10000
| Test | Size(B) | Avg.Time(us) |
|---|---|---|
| gdr_copy_to_mapping | 1 | 0.1250 |
| gdr_copy_to_mapping | 2 | 0.1245 |
| gdr_copy_to_mapping | 4 | 0.1245 |
| gdr_copy_to_mapping | 8 | 0.1222 |
| gdr_copy_to_mapping | 16 | 0.1263 |
| gdr_copy_to_mapping | 32 | 0.1252 |
| gdr_copy_to_mapping | 64 | 0.1280 |
| gdr_copy_to_mapping | 128 | 0.1376 |
| gdr_copy_to_mapping | 256 | 0.1439 |
| gdr_copy_to_mapping | 512 | 0.1550 |
| gdr_copy_to_mapping | 1024 | 0.1927 |
| gdr_copy_to_mapping | 2048 | 0.2631 |
| gdr_copy_to_mapping | 4096 | 0.4262 |
| gdr_copy_to_mapping | 8192 | 0.8239 |
| gdr_copy_to_mapping | 16384 | 1.6179 |
| gdr_copy_to_mapping | 32768 | 3.2132 |
| gdr_copy_to_mapping | 65536 | 6.4094 |
| gdr_copy_to_mapping | 131072 | 12.7935 |
| gdr_copy_to_mapping | 262144 | 25.5790 |
| gdr_copy_to_mapping | 524288 | 51.1738 |
| gdr_copy_to_mapping | 1048576 | 102.2248 |
| gdr_copy_to_mapping | 2097152 | 204.4293 |
| gdr_copy_to_mapping | 4194304 | 409.7942 |
| gdr_copy_to_mapping | 8388608 | 822.7885 |
| gdr_copy_to_mapping | 16777216 | 1683.7191 |
As for the PCIe topology, I'm not sure, but I did a lspci -tv:
-+-[0000:d7]-+-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-0e.0 Intel Corporation Device 2058
| +-0e.1 Intel Corporation Device 2059
| +-0f.0 Intel Corporation Device 2058
| +-0f.1 Intel Corporation Device 2059
| +-12.0 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.1 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.2 Intel Corporation Sky Lake-E M3KTI Registers
| +-15.0 Intel Corporation Sky Lake-E M2PCI Registers
| +-16.0 Intel Corporation Sky Lake-E M2PCI Registers
| \-16.4 Intel Corporation Sky Lake-E M2PCI Registers
+-[0000:ae]-+-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-08.0 Intel Corporation Device 2066
| +-09.0 Intel Corporation Device 2066
| +-0a.0 Intel Corporation Device 2040
| +-0a.1 Intel Corporation Device 2041
| +-0a.2 Intel Corporation Device 2042
| +-0a.3 Intel Corporation Device 2043
| +-0a.4 Intel Corporation Device 2044
| +-0a.5 Intel Corporation Device 2045
| +-0a.6 Intel Corporation Device 2046
| +-0a.7 Intel Corporation Device 2047
| +-0b.0 Intel Corporation Device 2048
| +-0b.1 Intel Corporation Device 2049
| +-0b.2 Intel Corporation Device 204a
| +-0b.3 Intel Corporation Device 204b
| +-0c.0 Intel Corporation Device 2040
| +-0c.1 Intel Corporation Device 2041
| +-0c.2 Intel Corporation Device 2042
| +-0c.3 Intel Corporation Device 2043
| +-0c.4 Intel Corporation Device 2044
| +-0c.5 Intel Corporation Device 2045
| +-0c.6 Intel Corporation Device 2046
| +-0c.7 Intel Corporation Device 2047
| +-0d.0 Intel Corporation Device 2048
| +-0d.1 Intel Corporation Device 2049
| +-0d.2 Intel Corporation Device 204a
| \-0d.3 Intel Corporation Device 204b
+-[0000:85]-+-00.0-[86]----00.0 NVIDIA Corporation GP100GL [Tesla P100 PCIe 12GB]
One caveat is that I probably could've used the -mavx512vl compilation flag to use up to 32 ymm registers for both AVX & AVX2, but I didn't. I wonder if loop-unrolling in the source-code should be tweaked if 32 registers are to be leveraged, instead of the default 16.