highway
highway copied to clipboard
RFC: renumber Arm targets + Apple feature detection
FYI we are working on supporting dynamic dispatch with Clang on Arm. As part of this, we may insert another NEON target using some of the optional features (fp16, bf16, dot, perhaps fp16fml - please let us know if you'd like to use/target others).
We'd want this target to be used if it's available, but it should not take precedence over any SVE targets. To enable that, we'd have to renumber the Arm targets. This could cause breakage for a project that uses the combination of:
- GCC on aarch64
- dynamic dispatch via foreach_target.h
- precompiled shared libraries or objects, which are not compiled fresh during a build.
This seems sufficiently unlikely, but please let us know within say a week if you have any concerns.
For concreteness, the plan is to insert 2 targets below HWY_NEON, 3 below HWY_SVE2, and that leaves 4 below HWY_SVE2_128.
Here is a function that can detect if an optional CPU feature is present on MacOS/iOS/iPad:
static HWY_INLINE bool HasCpuFeature(const char* feature_name) {
int result = 0;
size_t len = sizeof(int);
return sysctlbyname(feature_name, &result, &len, 0, 0) == 0 && result != 0;
}
Need to include the <sys/sysctl.h>
header to use the sysctlbyname function on MacOS/iOS/iPad.
A list of optional AArch64 SIMD ISA extensions that can be queried on MacOS/iOS/iPad can be found at https://developer.apple.com/documentation/kernel/1387446-sysctlbyname/determining_instruction_set_characteristics.
Thanks @johnplatts - good point, seems like a good occasion to also add support for runtime dispatch on Apple. I think the ones we'd look at are:
- NEON: AdvSIMD_HPFPCvt, FEAT_AES+FEAT_PMULL;
- NEON2 or NEON_8_6 (or any better ideas for the name?): FEAT_BF16, FEAT_DotProd, FEAT_FHM/FEAT_FP16.
AVX3/AVX3_DL target detection also should be updated for x86_64 on MacOS as (a) XGETBV might fail to report support for ZMM vectors and AVX3 mask registers on MacOS (even in the case where both the CPU and OS support AVX512F) until AVX512 instructions are invoked, and (b) there are bugs with AVX3/AVX3_DL context saving on MacOS releases earlier than 12.2.
Here are some functions that can be used to check that Highway is running on MacOS 12.2 or later (the below code requires that <sys/utsname.h>
be included on MacOS):
static HWY_INLINE bool ParseU32(const char*& ptr, uint32_t& parsed_val) {
uint64_t parsed_u64 = 0;
const char* start_ptr = ptr;
for (char ch; (ch = (*ptr)) != '\0'; ++ptr) {
unsigned digit = static_cast<unsigned char>(ch) -
static_cast<unsigned>(static_cast<unsigned char>('0'));
if (digit > 9) {
break;
}
parsed_u64 = (parsed_u64 * 10u) + digit;
if (parsed_u64 > 0xFFFFFFFFu) {
return false;
}
}
parsed_val = static_cast<uint32_t>(parsed_u64);
return (ptr != start_ptr);
}
static HWY_INLINE bool IsMacOS_12_2_Or_Later() {
struct utsname uname_buf;
ZeroBytes(uname_buf);
if ((uname(&uname_buf)) != 0) {
return false;
}
const char* ptr = uname_buf.release;
if (!ptr) {
return false;
}
uint32_t major;
uint32_t minor;
if (!ParseU32(ptr, major)) {
return false;
}
if (*ptr != '.') {
return false;
}
++ptr;
if (!ParseU32(ptr, minor)) {
return false;
}
// We are running on MacOS 12.2 or later if the Darwin kernel version is 21.3 or later
return (major > 21 || (major == 21 && minor >= 3));
}
Here is an updated snippet that correctly checks for AVX3 support on MacOS:
if (has_xsave && has_osxsave) {
#ifdef __APPLE__
// On MacOS, check for AVX3 support by checking that we are running on
// MacOS 12.2 or later and HasCpuFeature("hw.optional.avx512f") returns true
const bool have_avx3_xsave_support =
IsMacOS_12_2_Or_Later() && HasCpuFeature("hw.optional.avx512f");
#endif
const uint32_t xcr0 = ReadXCR0();
constexpr int64_t min_avx3 = HWY_AVX3 | HWY_AVX3_DL | HWY_AVX3_SPR;
// XMM/YMM
if (!IsBitSet(xcr0, 1) || !IsBitSet(xcr0, 2)) {
// Clear the AVX2/AVX3 bits if XMM/YMM XSAVE is not enabled
bits &= ~min_avx2;
}
#ifndef __APPLE__
// On OS's other than MacOS, check for AVX3 support by checking that bits 5,
// 6, and 7 of XCR0 are set
const bool have_avx3_xsave_support =
IsBitSet(xcr0, 5) && IsBitSet(xcr0, 6) && IsBitSet(xcr0, 7);
#endif
// opmask, ZMM lo/hi
if (!have_avx3_xsave_support) {
bits &= ~min_avx3;
}
} else { // !has_xsave || !has_osxsave
// Clear the AVX2/AVX3 bits if the CPU or OS does not support XSAVE
bits &= ~min_avx2;
}
The MacOS AVX3 context saving bug was mentioned at https://community.intel.com/t5/Software-Tuning-Performance/MacOS-Darwin-kernel-bug-clobbers-AVX-512-opmask-register-state/m-p/1327259, https://github.com/golang/go/issues/49233, and https://github.com/simdutf/simdutf/pull/236.
Nice find, thank you @johnplatts ! Would you like to send this code as a pull request, with a comment mentioning the intel.com forum discussion link?
Nice find, thank you @johnplatts ! Would you like to send this code as a pull request, with a comment mentioning the intel.com forum discussion link?
I have made the changes to x86 DetectTargets() that fix the issues with AVX3 detection on macOS in pull request #2083.
Also added HasCpuFeature in hwy/targets.cc that is available if Highway is being compiled for macOS/iOS/iPadOS in pull request #2083. HasCpuFeature is used in the updated implementation of DetectTargets() on macOS on x86 in pull request #2083 to check that the OS supports AVX3, and HasCpuFeature can also be used to detect support for some of the AArch64 SIMD extension set extensions on Apple Silicon CPU's.
Windows on AArch64 also has the IsProcessorFeaturePresent function that can check for the presence of some of the AArch64 instruction set extensions (including the SDOT/UDOT instructions), and the IsProcessorFeaturePresent function is described at https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-isprocessorfeaturepresent.
Windows on AArch64 also has the IsProcessorFeaturePresent function that can check for the presence of some of the AArch64 instruction set extensions
Unfortunately, that doesn't cover SVE. Any code with SVE intrinsics cannot be used on Windows targets, see: https://github.com/llvm/llvm-project/issues/64278#issuecomment-2081649539
Windows on AArch64 also has the IsProcessorFeaturePresent function that can check for the presence of some of the AArch64 instruction set extensions
Unfortunately, that doesn't cover SVE. Any code with SVE intrinsics cannot be used on Windows targets, see: llvm/llvm-project#64278 (comment)
Microsoft is likely planning on adding support for SVE in a future Windows release as Microsoft has recently added detection for SVE on Windows on AArch64 in the .NET Runtime according to a pull request that can be found at https://github.com/dotnet/runtime/pull/100937.
There is a new constant PF_ARM_SVE_INSTRUCTIONS_AVAILABLE that was recently added to https://github.com/dotnet/runtime/blob/main/src/native/minipal/cpufeatures.c for the AArch64 SVE feature that hasn't yet made its way into Windows headers or the IsProcessorFeaturePresent API documentation.
The Visual C++ 2022 compiler also does not currently have support for SVE, and compiling the SVE target for Windows on AArch64 requires Clang.
The renumbering is done, and thanks @johnplatts for adding the Apple detection :)