Support AVX2 dynamic dispatch
We currently detect BMI2 instructions at runtime, but users can only benefit from AVX2 if they compile with -march=haswell. It would be nice to provide AVX2 support to users who are compiling with default options.
This issue is motivated specifically by this loop in ZSTD_copyCDictTableIntoCCtx() which was added as part of my short cache PR. Overall extDict compression speed at level 1 is 2-3% slower if that loop is compiled to SSE2 instructions vs AVX2 instructions.
There may be other functions which can be tagged for AVX2 dispatch in the future. I expect this issue would be closed after tagging ZSTD_copyCDictTableIntoCCtx(), and we can tag additional functions gradually.
I already researched how we can safely detect AVX2 at runtime: https://stackoverflow.com/questions/72522885/are-the-xgetbv-and-cpuid-checks-sufficient-to-guarantee-avx2-support
@ValZapod At least on Linux, x86 feature levels have become a thing. Some distributions such as CachyOS offer x86-64-v3 compiled repositories already which is very near to -march=haswell.