memchr
memchr copied to clipboard
simd: support AVX2 detection without std
Summary
This PR implements AVX2 runtime feature detection that works in no_std
using the cpuid
instruction.
Motivation
memchr
provides low-level primitives, and it's unfortunate that we cannot make use of the most performant implementation without bringing in libstd
. And there's no good reason for it, either — unlike other architectures, detecting CPU features on x86 isn't OS dependent (see this comment).
The proper solution would be to wait until is_x86_feature_detected
is available in libcore
, but that probably won't happen any time soon. In the meantime, I'd like to propose doing it ourselves as a stopgap measure.
Drawbacks
-
This adds an amount of low-level complexity to the crate. Getting information from
cpuid
isn't the simplest thing and involves a lot of details to get right. However, compared to writing complex SIMD routines by hand, I believe it should be relatively manageable. -
The maintenance burden should be neglectable, since we only care about one feature and the CPUID "protocol" can never change, no reason to touch it in the future.
Rationale and alternatives
-
Not doing this. The real life use cases for the
no_std
+x86_64
+performance critical
combo aren't the most plentiful. I'm not sure how many people would actually have a need for this, besides "it's nice to have". Boils down to cost-benefit analysis. -
Use another crate to do the detection for us.
- This brings in extra dependencies.
-
raw-cpuid
supports everything related tocpuid
, and it's quite large as a result. Is the extra compile time worth it? -
cpufeatures
is much more lightweight, but it's detection logic is actually faulty — same issue as gcc#85100, glibc#13007, etc.
-
Fall back to
#[cfg(target_feature)]
compile-time detection forno_std
.- This needs to be enabled manually for the end users, but requires the least amount of effort to implement.
- The Steam Hardware Survey shows that 87% of Steam users have AVX2 support. While this survery is biased for higher-end gaming PCs, I think it's still fair to say that AVX2 is fairly common nowadays. It makes sense for people to enable it unconditionally, depending on their target demographics.
Prior art
-
getrandom
usescpuid
to detectrdrand
. -
rdrand
also do this when thestd
feature is disabled. - Most crates under RustCrypto are
no_std
, which use the previously mentionedcpufeatures
crate for feature detection. -
simdutf8 does
target_feature
compile-time detection inno_std
.
Thanks for the PR and the very thorough analysis! I'm definitely a bit hesitant to commit to something like this, in part because it's hard to know whether the code written for the CPUID check is correct or not without spending the time to go and understand everything myself. (Which I don't have time for right now.)
The proper solution would be to wait until
is_x86_feature_detected
is available inlibcore
, but that probably won't happen any time soon. In the meantime, I'd like to propose doing it ourselves as a stopgap measure.
Could you say what this is a stop gap for? In particular, you later say:
- The real life use cases for the
no_std
+x86_64
+performance critical
combo aren't the most plentiful. I'm not sure how many people would actually have a need for this, besides "it's nice to have". Boils down to cost-benefit analysis.
So who would actually benefit from this? Is there a concrete use case that motivated this PR in the first place? In my view, if this is just "nice to have," then I'm not sure it's worth doing. Users will get the SSE2 implementation at least, which is also quite fast.
Yeah, this is about the kind of reply I had expected… The question of "Is it worth it?" and "Can you provide an actual use case?" are definitely the two things I was most unsure about when I posted this PR.
The thing about performance is, well, for most cases it doesn't really matter that much. Even without considering no_std
, I'd be hard pressed to think of any real world situation where AVX is absolutely necessary (except for benchmarking, of course).
In my case, I was writing a no_std
project with "fast" being one of its goals. It spends most of its time in memchr
. But how fast is considered fast enough? Compared to the SSE implementation, I remembered I had measured about something like 10%~20% increase in throughput. But do I really need that extra throughput? Probably not. This was more about "the code is already there, why not try to make use of that" than any practical need that motivates it.
In the end, whether something is considered "worth doing" is quite subjective, and it's up to the maintainer to decide what sort of trade-offs they want for their project. I too have reservations about this PR, but can't be sure without at least trying right:joy:. Feel free to close if that's what you decided, I don't think I can make a convincing case of it. It's definitely understandable, I'd be hesitant to mess with something like CPUID just for this, too.
Edit: Oh by the way,
- Fall back to
#[cfg(target_feature)]
compile-time detection forno_std
.
Can we do this to keep the AVX option available at least? Most people probably still won't benefit from this, but it should be a minimal three-lines change.
Out of curiosity, what is the use case for a no_std
crate on x86?
Edit: Oh by the way,
- Fall back to
#[cfg(target_feature)]
compile-time detection forno_std
.Can we do this to keep the AVX option available at least? Most people probably still won't benefit from this, but it should be a minimal three-lines change.
Yeah I think I can get on board with this. Otherwise, I'll probably hold off on this PR.
Out of curiosity, what is the use case for a
no_std
crate on x86?
I may misunderstand (edit: yes), but I can think of a likely case and an unlikely (or rather uncommon) case:
- a library that wants to be usable by
no_std
and non-no_std
crates, on x86 or not; and - a kernel? (I don't know whether the latter would have some other reason to be unable to use this library.)
@8573 The former case isn't really applicable here. This library already gives you that. The context that is important here is actually running code without std on x86, as this PR is about making a faster version of memchr available in such scenarios.
I am bringing in #120 so hopefully that's good enough. Otherwise I don't think the use case is compelling enough here to justify maintaining my own CPUID checks. I'm not 100% opposed to it, but I would need a really compelling use case for it because it looks like a maintenance hazard to me.