vecmem icon indicating copy to clipboard operation
vecmem copied to clipboard

AoS / SoA Copy Benchmarks, main branch (2024.09.27.)

Open krasznaa opened this issue 5 months ago • 1 comments

While working on https://github.com/acts-project/traccc/pull/712 yesterday, I was surprised to see how expensive it apparently is to copy a few megabytes of cell information from one host location to another. What I saw in NSight Systems was that doing a vecmem::edm::host -> vecmem::edm::buffer host-to-host copy was very comparable to then doing a vecmem::edm::buffer -> vecmem::edm::buffer host-to-device copy.

As a reminder, the host-to-host step would seem to be useful to copy the entire payload of an SoA container in one step, instead of copying its payload column-by-column.

But as it turns out, the overhead of copying a cell collection in 5 steps instead of one (a traccc cell has only 5 variables) is negligible compared to how long it takes to copy a few megabytes from one place to another in host memory. :confused:

So in this PR I want to see exactly how copying the same sort of EDM, once in AoS and then in SoA form, would compare with each other. Right now, with only the host copies existing, I get:

[bash][pcadp04]:vecmem > ./build/bin/vecmem_benchmark_core --benchmark_filter="AoS|SoA"
2024-09-27T11:02:20+02:00
Running ./build/bin/vecmem_benchmark_core
Run on (48 X 1796.81 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x4)
Load Average: 0.06, 0.07, 0.21
---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
simpleSoADirectHostToFixedBufferCopy/1               47.0 ns         46.9 ns     14915255 Bytes=16 Rate=325.124M/s
simpleSoADirectHostToFixedBufferCopy/8               44.8 ns         44.7 ns     15652590 Bytes=128 Rate=2.66556G/s
simpleSoADirectHostToFixedBufferCopy/64              48.8 ns         48.7 ns     14353059 Bytes=1024 Rate=19.5998G/s
simpleSoADirectHostToFixedBufferCopy/512              113 ns          113 ns      6180930 Bytes=8k Rate=67.4712G/s
simpleSoADirectHostToFixedBufferCopy/4096            1237 ns         1235 ns       566747 Bytes=64k Rate=49.4234G/s
simpleSoADirectHostToFixedBufferCopy/32768          15540 ns        15496 ns        45228 Bytes=512k Rate=31.5094G/s
simpleSoADirectHostToFixedBufferCopy/262144        124628 ns       124264 ns         5585 Bytes=4M Rate=31.435G/s
simpleSoADirectHostToFixedBufferCopy/2097152      1440522 ns      1434445 ns          487 Bytes=32M Rate=21.7854G/s
simpleSoADirectHostToFixedBufferCopy/16777216     9900081 ns      9868170 ns           62 Bytes=256M Rate=25.334G/s
simpleSoADirectHostToFixedBufferCopy/67108864    44172606 ns     44038170 ns           14 Bytes=1024M Rate=22.7076G/s
simpleSoAOptimalHostToFixedBufferCopy/1              25.4 ns         25.4 ns     27560671 Bytes=16 Rate=600.753M/s
simpleSoAOptimalHostToFixedBufferCopy/8              25.5 ns         25.4 ns     27556719 Bytes=128 Rate=4.69248G/s
simpleSoAOptimalHostToFixedBufferCopy/64             31.3 ns         31.3 ns     22272491 Bytes=1024 Rate=30.4965G/s
simpleSoAOptimalHostToFixedBufferCopy/512            90.7 ns         90.5 ns      7724358 Bytes=8k Rate=84.263G/s
simpleSoAOptimalHostToFixedBufferCopy/4096           1212 ns         1210 ns       579650 Bytes=64k Rate=50.4524G/s
simpleSoAOptimalHostToFixedBufferCopy/32768         15656 ns        15610 ns        45481 Bytes=512k Rate=31.2805G/s
simpleSoAOptimalHostToFixedBufferCopy/262144       124211 ns       123845 ns         5603 Bytes=4M Rate=31.5414G/s
simpleSoAOptimalHostToFixedBufferCopy/2097152     1148184 ns      1144388 ns          601 Bytes=32M Rate=27.3072G/s
simpleSoAOptimalHostToFixedBufferCopy/16777216    9687522 ns      9655885 ns           63 Bytes=256M Rate=25.8909G/s
simpleSoAOptimalHostToFixedBufferCopy/67108864   47272263 ns     47123369 ns           13 Bytes=1024M Rate=21.2209G/s
simpleAoSHostToFixedBufferCopy/1                     19.4 ns         19.3 ns     36222530 Bytes=16 Rate=789.511M/s
simpleAoSHostToFixedBufferCopy/8                     19.4 ns         19.3 ns     36219083 Bytes=128 Rate=6.16873G/s
simpleAoSHostToFixedBufferCopy/64                    23.5 ns         23.5 ns     29830658 Bytes=1024 Rate=40.6423G/s
simpleAoSHostToFixedBufferCopy/512                   79.7 ns         79.5 ns      8805052 Bytes=8k Rate=95.9561G/s
simpleAoSHostToFixedBufferCopy/4096                  1192 ns         1190 ns       587952 Bytes=64k Rate=51.2917G/s
simpleAoSHostToFixedBufferCopy/32768                15308 ns        15264 ns        45633 Bytes=512k Rate=31.9891G/s
simpleAoSHostToFixedBufferCopy/262144              123389 ns       123034 ns         5707 Bytes=4M Rate=31.7494G/s
simpleAoSHostToFixedBufferCopy/2097152            1143912 ns      1140204 ns          599 Bytes=32M Rate=27.4074G/s
simpleAoSHostToFixedBufferCopy/16777216           9814677 ns      9783654 ns           62 Bytes=256M Rate=25.5528G/s
simpleAoSHostToFixedBufferCopy/67108864          47313324 ns     47164521 ns           13 Bytes=1024M Rate=21.2024G/s
[bash][pcadp04]:vecmem >

Many aspects of these results I believe I understand. But I'm really not sure why the copy speed drops as it does for large sizes. :confused:

In any case, I plan to continue the investigation...

krasznaa avatar Sep 27 '24 09:09 krasznaa