Software path not available for running benchmark.
Hi :)
I'm new to this project and am trying to run some benchmarks provided by DML.
I've successfully configured DSA and run the hardware path. But somehow, the software path fails while I try to run the performance test with the provided benchmark framework.
- I first checked if my server could run the software and hardware paths. Both match the requirements and the tests are successfully finished. I've included some samples below.
$ sudo ./build/examples/high-level-api/hl_mem_move_example software_path
Executing using dml::software path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Finished successfully.
$ sudo ./build/examples/high-level-api/hl_mem_move_example hardware_path
Executing using dml::hardware path
Starting dml::mem_move example...
Copy 1KB of data from source into destination...
Finished successfully.
- I've found that Benchmarking section of the documentation says that, path:cpu is used for benchmark running on the CPU. However I cannot get any CPU-related results. Here is the sample below.
$ sudo ./build/bin/dml_benchmarks --benchmark_filter="copy/.*/exec:sync/.*/size:4096/.*" --benchmark_min_time=0.1
== Host: XXX
== Kernel: 6.6.38-XXX
== CPU: Intel(R) Xeon(R) Gold 6454S (143)
--> Microcode: 0x2b0005c0
--> Stepping: 8
--> Logical: 64
--> Physical: 64
--> Socket: 32
--> Cluster: 8
== Accelerators: 4
--> NUMA 0: 2
--> NUMA 1: 2
== Affinity Map [device_index: thread_index(cpu_index)...]:
--> 0(32) 1(40) 2(48) 3(56) 4(33) 5(41) 6(49) 7(57)
2024-07-13T18:15:11+00:00
Running ./build/bin/dml_benchmarks
Run on (64 X 3400 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 2048 KiB (x64)
L3 Unified 61440 KiB (x2)
Load Average: 0.02, 0.05, 0.07
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
copy/api:c/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 5393 ns 5394 ns 25908 Latency=5.39321us Latency/Op=337.076ns Throughput=12.1516G/s
copy/api:cpp/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 7773 ns 7775 ns 20004 Latency=7.77254us Latency/Op=485.784ns Throughput=8.43174G/s
copy/api:c/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 6148 ns 6150 ns 23922 Latency=6.14786us Latency/Op=384.241ns Throughput=10.66G/s
copy/api:cpp/path:dsa/exec:sync/qsize:16/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 7457 ns 7460 ns 17897 Latency=7.45672us Latency/Op=466.045ns Throughput=8.78886G/s
copy/api:c/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 9363 ns 9364 ns 14903 Latency=9.36282us Latency/Op=292.588ns Throughput=13.9992G/s
copy/api:cpp/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 12516 ns 12526 ns 11455 Latency=12.5165us Latency/Op=391.14ns Throughput=10.4719G/s
copy/api:c/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 10308 ns 10309 ns 13652 Latency=10.308us Latency/Op=322.126ns Throughput=12.7155G/s
copy/api:cpp/path:dsa/exec:sync/qsize:32/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 12214 ns 12218 ns 11123 Latency=12.2144us Latency/Op=381.699ns Throughput=10.731G/s
copy/api:c/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 18702 ns 18711 ns 7474 Latency=18.702us Latency/Op=292.219ns Throughput=14.0169G/s
copy/api:cpp/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 23672 ns 23688 ns 5919 Latency=23.6724us Latency/Op=369.881ns Throughput=11.0738G/s
copy/api:c/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 20185 ns 20202 ns 6897 Latency=20.1849us Latency/Op=315.389ns Throughput=12.9871G/s
copy/api:cpp/path:dsa/exec:sync/qsize:64/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 22601 ns 22613 ns 5871 Latency=22.6012us Latency/Op=353.144ns Throughput=11.5987G/s
copy/api:c/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 34582 ns 34600 ns 4046 Latency=34.5817us Latency/Op=270.17ns Throughput=15.1608G/s
copy/api:cpp/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:proc/size:4096/real_time 44488 ns 44524 ns 3149 Latency=44.4884us Latency/Op=347.566ns Throughput=11.7848G/s
copy/api:c/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 38350 ns 38386 ns 3648 Latency=38.3497us Latency/Op=299.607ns Throughput=13.6712G/s
copy/api:cpp/path:dsa/exec:sync/qsize:128/bsize:0/in_mem:ram/out_mem:cc_def/timer:full/size:4096/real_time 44497 ns 44534 ns 3135 Latency=44.4972us Latency/Op=347.634ns Throughput=11.7825G/s
- Also, I've tried to run some cpu path work directly. But it fails due to regex.
$ sudo ./build/bin/dml_benchmarks --benchmark_filter="copy/api:c/path:cpu/exec:sync/.*" --no_hw
== Host: XXX
== Kernel: 6.6.38-XXX
== CPU: Intel(R) Xeon(R) Gold 6454S (143)
--> Microcode: 0x2b0005c0
--> Stepping: 8
--> Logical: 64
--> Physical: 64
--> Socket: 32
--> Cluster: 8
== Accelerators: 4
--> NUMA 0: 2
--> NUMA 1: 2
== Affinity Map [device_index: thread_index(cpu_index)...]:
--> 0(32) 1(40) 2(48) 3(56) 4(33) 5(41) 6(49) 7(57)
Failed to match any benchmarks against regex: copy/api:c/path:cpu/exec:sync/.*
In the Introduction section, it claims that software path will be used in case if hardware accelerator is not available. So, I've disabled DSA with the 'accel-config' library and re-run the work. But software path still doesn't work.
I'm sorry for the multiple questions. But the point is that I'm trying to run the software path on cpu, to compare performance with the hardware path(DSA).
If there is any further information needed. Please ask me.
Thank you :)