vecmem
vecmem copied to clipboard
AoS / SoA Copy Benchmarks, main branch (2024.09.27.)
While working on https://github.com/acts-project/traccc/pull/712 yesterday, I was surprised to see how expensive it apparently is to copy a few megabytes of cell information from one host location to another. What I saw in NSight Systems was that doing a vecmem::edm::host
-> vecmem::edm::buffer
host-to-host copy was very comparable to then doing a vecmem::edm::buffer
-> vecmem::edm::buffer
host-to-device copy.
As a reminder, the host-to-host step would seem to be useful to copy the entire payload of an SoA container in one step, instead of copying its payload column-by-column.
But as it turns out, the overhead of copying a cell collection in 5 steps instead of one (a traccc cell has only 5 variables) is negligible compared to how long it takes to copy a few megabytes from one place to another in host memory. :confused:
So in this PR I want to see exactly how copying the same sort of EDM, once in AoS and then in SoA form, would compare with each other. Right now, with only the host copies existing, I get:
[bash][pcadp04]:vecmem > ./build/bin/vecmem_benchmark_core --benchmark_filter="AoS|SoA"
2024-09-27T11:02:20+02:00
Running ./build/bin/vecmem_benchmark_core
Run on (48 X 1796.81 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x24)
L1 Instruction 32 KiB (x24)
L2 Unified 512 KiB (x24)
L3 Unified 32768 KiB (x4)
Load Average: 0.06, 0.07, 0.21
---------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
simpleSoADirectHostToFixedBufferCopy/1 47.0 ns 46.9 ns 14915255 Bytes=16 Rate=325.124M/s
simpleSoADirectHostToFixedBufferCopy/8 44.8 ns 44.7 ns 15652590 Bytes=128 Rate=2.66556G/s
simpleSoADirectHostToFixedBufferCopy/64 48.8 ns 48.7 ns 14353059 Bytes=1024 Rate=19.5998G/s
simpleSoADirectHostToFixedBufferCopy/512 113 ns 113 ns 6180930 Bytes=8k Rate=67.4712G/s
simpleSoADirectHostToFixedBufferCopy/4096 1237 ns 1235 ns 566747 Bytes=64k Rate=49.4234G/s
simpleSoADirectHostToFixedBufferCopy/32768 15540 ns 15496 ns 45228 Bytes=512k Rate=31.5094G/s
simpleSoADirectHostToFixedBufferCopy/262144 124628 ns 124264 ns 5585 Bytes=4M Rate=31.435G/s
simpleSoADirectHostToFixedBufferCopy/2097152 1440522 ns 1434445 ns 487 Bytes=32M Rate=21.7854G/s
simpleSoADirectHostToFixedBufferCopy/16777216 9900081 ns 9868170 ns 62 Bytes=256M Rate=25.334G/s
simpleSoADirectHostToFixedBufferCopy/67108864 44172606 ns 44038170 ns 14 Bytes=1024M Rate=22.7076G/s
simpleSoAOptimalHostToFixedBufferCopy/1 25.4 ns 25.4 ns 27560671 Bytes=16 Rate=600.753M/s
simpleSoAOptimalHostToFixedBufferCopy/8 25.5 ns 25.4 ns 27556719 Bytes=128 Rate=4.69248G/s
simpleSoAOptimalHostToFixedBufferCopy/64 31.3 ns 31.3 ns 22272491 Bytes=1024 Rate=30.4965G/s
simpleSoAOptimalHostToFixedBufferCopy/512 90.7 ns 90.5 ns 7724358 Bytes=8k Rate=84.263G/s
simpleSoAOptimalHostToFixedBufferCopy/4096 1212 ns 1210 ns 579650 Bytes=64k Rate=50.4524G/s
simpleSoAOptimalHostToFixedBufferCopy/32768 15656 ns 15610 ns 45481 Bytes=512k Rate=31.2805G/s
simpleSoAOptimalHostToFixedBufferCopy/262144 124211 ns 123845 ns 5603 Bytes=4M Rate=31.5414G/s
simpleSoAOptimalHostToFixedBufferCopy/2097152 1148184 ns 1144388 ns 601 Bytes=32M Rate=27.3072G/s
simpleSoAOptimalHostToFixedBufferCopy/16777216 9687522 ns 9655885 ns 63 Bytes=256M Rate=25.8909G/s
simpleSoAOptimalHostToFixedBufferCopy/67108864 47272263 ns 47123369 ns 13 Bytes=1024M Rate=21.2209G/s
simpleAoSHostToFixedBufferCopy/1 19.4 ns 19.3 ns 36222530 Bytes=16 Rate=789.511M/s
simpleAoSHostToFixedBufferCopy/8 19.4 ns 19.3 ns 36219083 Bytes=128 Rate=6.16873G/s
simpleAoSHostToFixedBufferCopy/64 23.5 ns 23.5 ns 29830658 Bytes=1024 Rate=40.6423G/s
simpleAoSHostToFixedBufferCopy/512 79.7 ns 79.5 ns 8805052 Bytes=8k Rate=95.9561G/s
simpleAoSHostToFixedBufferCopy/4096 1192 ns 1190 ns 587952 Bytes=64k Rate=51.2917G/s
simpleAoSHostToFixedBufferCopy/32768 15308 ns 15264 ns 45633 Bytes=512k Rate=31.9891G/s
simpleAoSHostToFixedBufferCopy/262144 123389 ns 123034 ns 5707 Bytes=4M Rate=31.7494G/s
simpleAoSHostToFixedBufferCopy/2097152 1143912 ns 1140204 ns 599 Bytes=32M Rate=27.4074G/s
simpleAoSHostToFixedBufferCopy/16777216 9814677 ns 9783654 ns 62 Bytes=256M Rate=25.5528G/s
simpleAoSHostToFixedBufferCopy/67108864 47313324 ns 47164521 ns 13 Bytes=1024M Rate=21.2024G/s
[bash][pcadp04]:vecmem >
Many aspects of these results I believe I understand. But I'm really not sure why the copy speed drops as it does for large sizes. :confused:
In any case, I plan to continue the investigation...