openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[CPU] Enable cpu_convert utility function with AVX_512 FP16

Open siddhant-0707 opened this issue 1 year ago • 8 comments

Details:

  • Augmented cpu_convert function by adding AVX512 fp16 load/store instruction support.
  • adding tests

Tickets:

  • Closes #21809

siddhant-0707 avatar Jan 10 '24 18:01 siddhant-0707

Hey @ceciliapeng2011 could you guide me a little bit on where and how to create the required tests for this? Please review my changes as well, thanks!

siddhant-0707 avatar Jan 10 '24 18:01 siddhant-0707

Hey @ceciliapeng2011, please review, thanks!

siddhant-0707 avatar Jan 18 '24 08:01 siddhant-0707

thanks for contribution @siddhant-0707 For unit test, maybe you could refer to this unit test - https://github.com/openvinotoolkit/openvino/pull/22174/files#diff-68058ac8f7dc6ca18caffd8e7ff762035dcd7ee15f5ccf4b78856a9a50639adf

ceciliapeng2011 avatar Jan 19 '24 07:01 ceciliapeng2011

Hey, on my machine I am able to run the test (it uses avx2)

// Using AVX2
[ RUN      ] cpu_convert.AVX512_fp16_load_store
size 1000: 44 microseconds
size 10000: 4 microseconds
size 100000: 33 microseconds
size 1000000: 596 microseconds
[       OK ] cpu_convert.AVX512_fp16_load_store (5 ms)
[----------] 1 test from cpu_convert (5 ms total)

Will have to run on Intel Xeon to see avx512 performance. What is the CI configuration?

siddhant-0707 avatar Jan 27 '24 09:01 siddhant-0707

This PR will be closed in a week because of 2 weeks of no activity.

github-actions[bot] avatar Feb 12 '24 00:02 github-actions[bot]

This PR was closed because it has been stalled for 2 week with no activity.

github-actions[bot] avatar Feb 19 '24 00:02 github-actions[bot]

CI probably cannot benchmark the performance across platforms. Do you have a local machine with AVX512?

ceciliapeng2011 avatar Feb 22 '24 01:02 ceciliapeng2011

No, unfortunately I don't

siddhant-0707 avatar Feb 22 '24 05:02 siddhant-0707

This PR will be closed in a week because of 2 weeks of no activity.

github-actions[bot] avatar Mar 08 '24 00:03 github-actions[bot]

This PR was closed because it has been stalled for 2 week with no activity.

github-actions[bot] avatar Mar 16 '24 00:03 github-actions[bot]

Hey @ceciliapeng2011 finally got a machine with AVX512 capability. Here are the results I collected after changing the lines you indicated to:

constexpr size_t vlen = 16u;
constexpr size_t vlen_log2 = 4;
[ RUN      ] cpu_convert.AVX512_fp16_load_store
size 1000: 645 microseconds
size 10000: 1055 microseconds
size 100000: 48 microseconds
size 1000000: 122 microseconds
size 10000000: 1398 microseconds
[       OK ] cpu_convert.AVX512_fp16_load_store (17 ms)


[ RUN      ] cpu_convert.AVX512_fp16_load_store
size 1000: 674 microseconds
size 10000: 1112 microseconds
size 100000: 41 microseconds
size 1000000: 143 microseconds
size 10000000: 1293 microseconds
[       OK ] cpu_convert.AVX512_fp16_load_store (17 ms)

siddhant-0707 avatar Mar 19 '24 10:03 siddhant-0707

Hey @ceciliapeng2011 finally got a machine with AVX512 capability. Here are the results I collected after changing the lines you indicated to:

constexpr size_t vlen = 16u;
constexpr size_t vlen_log2 = 4;
[ RUN      ] cpu_convert.AVX512_fp16_load_store
size 1000: 645 microseconds
size 10000: 1055 microseconds
size 100000: 48 microseconds
size 1000000: 122 microseconds
size 10000000: 1398 microseconds
[       OK ] cpu_convert.AVX512_fp16_load_store (17 ms)


[ RUN      ] cpu_convert.AVX512_fp16_load_store
size 1000: 674 microseconds
size 10000: 1112 **microseconds**
size 100000: 41 microseconds
size 1000000: 143 microseconds
size 10000000: 1293 microseconds
[       OK ] cpu_convert.AVX512_fp16_load_store (17 ms)

Glad you have the AVX512 machine and continue the job! Great! So would you please benchmark the workload with both AVX2 and AVX512 on the same machine?

please make sure the scaling governors mode of your machine is performance (default is powersave) before benchmarking. You could set it with Linux command - echo "performance " | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

ceciliapeng2011 avatar Mar 24 '24 06:03 ceciliapeng2011

This PR will be closed in a week because of 2 weeks of no activity.

github-actions[bot] avatar Apr 12 '24 00:04 github-actions[bot]

@siddhant-0707 From my perspective, this PR still needs the following two unit tests -

  1. cross-compare the performance number of converting with avx2 and avx512 fp16 with different workloads;
  2. validate the output result

ceciliapeng2011 avatar Apr 19 '24 03:04 ceciliapeng2011

hey @siddhant-0707, will you have a time to finish this PR?

mlukasze avatar Jun 18 '24 08:06 mlukasze