openvino
openvino copied to clipboard
[CPU] Enable cpu_convert utility function with AVX_512 FP16
Details:
- Augmented cpu_convert function by adding AVX512 fp16 load/store instruction support.
- adding tests
Tickets:
- Closes #21809
Hey @ceciliapeng2011 could you guide me a little bit on where and how to create the required tests for this? Please review my changes as well, thanks!
Hey @ceciliapeng2011, please review, thanks!
thanks for contribution @siddhant-0707 For unit test, maybe you could refer to this unit test - https://github.com/openvinotoolkit/openvino/pull/22174/files#diff-68058ac8f7dc6ca18caffd8e7ff762035dcd7ee15f5ccf4b78856a9a50639adf
Hey, on my machine I am able to run the test (it uses avx2
)
// Using AVX2
[ RUN ] cpu_convert.AVX512_fp16_load_store
size 1000: 44 microseconds
size 10000: 4 microseconds
size 100000: 33 microseconds
size 1000000: 596 microseconds
[ OK ] cpu_convert.AVX512_fp16_load_store (5 ms)
[----------] 1 test from cpu_convert (5 ms total)
Will have to run on Intel Xeon to see avx512
performance. What is the CI configuration?
This PR will be closed in a week because of 2 weeks of no activity.
This PR was closed because it has been stalled for 2 week with no activity.
CI probably cannot benchmark the performance across platforms. Do you have a local machine with AVX512?
No, unfortunately I don't
This PR will be closed in a week because of 2 weeks of no activity.
This PR was closed because it has been stalled for 2 week with no activity.
Hey @ceciliapeng2011 finally got a machine with AVX512 capability. Here are the results I collected after changing the lines you indicated to:
constexpr size_t vlen = 16u;
constexpr size_t vlen_log2 = 4;
[ RUN ] cpu_convert.AVX512_fp16_load_store
size 1000: 645 microseconds
size 10000: 1055 microseconds
size 100000: 48 microseconds
size 1000000: 122 microseconds
size 10000000: 1398 microseconds
[ OK ] cpu_convert.AVX512_fp16_load_store (17 ms)
[ RUN ] cpu_convert.AVX512_fp16_load_store
size 1000: 674 microseconds
size 10000: 1112 microseconds
size 100000: 41 microseconds
size 1000000: 143 microseconds
size 10000000: 1293 microseconds
[ OK ] cpu_convert.AVX512_fp16_load_store (17 ms)
Hey @ceciliapeng2011 finally got a machine with AVX512 capability. Here are the results I collected after changing the lines you indicated to:
constexpr size_t vlen = 16u; constexpr size_t vlen_log2 = 4;
[ RUN ] cpu_convert.AVX512_fp16_load_store size 1000: 645 microseconds size 10000: 1055 microseconds size 100000: 48 microseconds size 1000000: 122 microseconds size 10000000: 1398 microseconds [ OK ] cpu_convert.AVX512_fp16_load_store (17 ms) [ RUN ] cpu_convert.AVX512_fp16_load_store size 1000: 674 microseconds size 10000: 1112 **microseconds** size 100000: 41 microseconds size 1000000: 143 microseconds size 10000000: 1293 microseconds [ OK ] cpu_convert.AVX512_fp16_load_store (17 ms)
Glad you have the AVX512 machine and continue the job! Great! So would you please benchmark the workload with both AVX2 and AVX512 on the same machine?
please make sure the scaling governors mode of your machine is performance (default is powersave) before benchmarking.
You could set it with Linux command -
echo "performance " | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
This PR will be closed in a week because of 2 weeks of no activity.
@siddhant-0707 From my perspective, this PR still needs the following two unit tests -
- cross-compare the performance number of converting with avx2 and avx512 fp16 with different workloads;
- validate the output result
hey @siddhant-0707, will you have a time to finish this PR?