Add AVX 512 support
- Use 256-bit registers.
- Use masked load if possible
AVX-512 intrinsics are currently nightly only and the speedup potential is unclear. Furthermore AVX throttling needs be taken into consideration.
Throttling is a concern but presumably only on wide (512-bit) registers as @travisdowns explained well in his answer. Stick with 256-bit and you'll be fine (in this instance, there are no heavy instructions involved).
Newer client chips (e.g. Ice, Tiger and Rocket Lakes) work a bit differently (heavy vs light distinction disappears, only width seems to matter) but regardless Daniel's advice still applies: you shouldn't see license-based throttling with 256-bit ops.