memchr icon indicating copy to clipboard operation
memchr copied to clipboard

simd: support AVX2 detection without std

Open andylizi opened this issue 2 years ago • 5 comments

Summary

This PR implements AVX2 runtime feature detection that works in no_std using the cpuid instruction.

Motivation

memchr provides low-level primitives, and it's unfortunate that we cannot make use of the most performant implementation without bringing in libstd. And there's no good reason for it, either — unlike other architectures, detecting CPU features on x86 isn't OS dependent (see this comment).

The proper solution would be to wait until is_x86_feature_detected is available in libcore, but that probably won't happen any time soon. In the meantime, I'd like to propose doing it ourselves as a stopgap measure.

Drawbacks

  • This adds an amount of low-level complexity to the crate. Getting information from cpuid isn't the simplest thing and involves a lot of details to get right. However, compared to writing complex SIMD routines by hand, I believe it should be relatively manageable.

  • The maintenance burden should be neglectable, since we only care about one feature and the CPUID "protocol" can never change, no reason to touch it in the future.

Rationale and alternatives

  • Not doing this. The real life use cases for the no_std + x86_64 + performance critical combo aren't the most plentiful. I'm not sure how many people would actually have a need for this, besides "it's nice to have". Boils down to cost-benefit analysis.

  • Use another crate to do the detection for us.

    • This brings in extra dependencies.
    • raw-cpuid supports everything related to cpuid, and it's quite large as a result. Is the extra compile time worth it?
    • cpufeatures is much more lightweight, but it's detection logic is actually faulty — same issue as gcc#85100, glibc#13007, etc.
  • Fall back to #[cfg(target_feature)] compile-time detection for no_std.

    • This needs to be enabled manually for the end users, but requires the least amount of effort to implement.
    • The Steam Hardware Survey shows that 87% of Steam users have AVX2 support. While this survery is biased for higher-end gaming PCs, I think it's still fair to say that AVX2 is fairly common nowadays. It makes sense for people to enable it unconditionally, depending on their target demographics.

Prior art

  • getrandom uses cpuid to detect rdrand.
  • rdrand also do this when the std feature is disabled.
  • Most crates under RustCrypto are no_std, which use the previously mentioned cpufeatures crate for feature detection.
  • simdutf8 does target_feature compile-time detection in no_std.

andylizi avatar Apr 07 '22 16:04 andylizi

Thanks for the PR and the very thorough analysis! I'm definitely a bit hesitant to commit to something like this, in part because it's hard to know whether the code written for the CPUID check is correct or not without spending the time to go and understand everything myself. (Which I don't have time for right now.)

The proper solution would be to wait until is_x86_feature_detected is available in libcore, but that probably won't happen any time soon. In the meantime, I'd like to propose doing it ourselves as a stopgap measure.

Could you say what this is a stop gap for? In particular, you later say:

  • The real life use cases for the no_std + x86_64 + performance critical combo aren't the most plentiful. I'm not sure how many people would actually have a need for this, besides "it's nice to have". Boils down to cost-benefit analysis.

So who would actually benefit from this? Is there a concrete use case that motivated this PR in the first place? In my view, if this is just "nice to have," then I'm not sure it's worth doing. Users will get the SSE2 implementation at least, which is also quite fast.

BurntSushi avatar Apr 28 '22 12:04 BurntSushi

Yeah, this is about the kind of reply I had expected… The question of "Is it worth it?" and "Can you provide an actual use case?" are definitely the two things I was most unsure about when I posted this PR.

The thing about performance is, well, for most cases it doesn't really matter that much. Even without considering no_std, I'd be hard pressed to think of any real world situation where AVX is absolutely necessary (except for benchmarking, of course).

In my case, I was writing a no_std project with "fast" being one of its goals. It spends most of its time in memchr. But how fast is considered fast enough? Compared to the SSE implementation, I remembered I had measured about something like 10%~20% increase in throughput. But do I really need that extra throughput? Probably not. This was more about "the code is already there, why not try to make use of that" than any practical need that motivates it.

In the end, whether something is considered "worth doing" is quite subjective, and it's up to the maintainer to decide what sort of trade-offs they want for their project. I too have reservations about this PR, but can't be sure without at least trying right:joy:. Feel free to close if that's what you decided, I don't think I can make a convincing case of it. It's definitely understandable, I'd be hesitant to mess with something like CPUID just for this, too.


Edit: Oh by the way,

  • Fall back to #[cfg(target_feature)] compile-time detection for no_std.

Can we do this to keep the AVX option available at least? Most people probably still won't benefit from this, but it should be a minimal three-lines change.

andylizi avatar Apr 28 '22 15:04 andylizi

Out of curiosity, what is the use case for a no_std crate on x86?

Edit: Oh by the way,

  • Fall back to #[cfg(target_feature)] compile-time detection for no_std.

Can we do this to keep the AVX option available at least? Most people probably still won't benefit from this, but it should be a minimal three-lines change.

Yeah I think I can get on board with this. Otherwise, I'll probably hold off on this PR.

BurntSushi avatar Apr 28 '22 15:04 BurntSushi

Out of curiosity, what is the use case for a no_std crate on x86?

I may misunderstand (edit: yes), but I can think of a likely case and an unlikely (or rather uncommon) case:

  • a library that wants to be usable by no_std and non-no_std crates, on x86 or not; and
  • a kernel? (I don't know whether the latter would have some other reason to be unable to use this library.)

8573 avatar Jun 02 '22 08:06 8573

@8573 The former case isn't really applicable here. This library already gives you that. The context that is important here is actually running code without std on x86, as this PR is about making a faster version of memchr available in such scenarios.

BurntSushi avatar Jun 02 '22 10:06 BurntSushi

I am bringing in #120 so hopefully that's good enough. Otherwise I don't think the use case is compelling enough here to justify maintaining my own CPUID checks. I'm not 100% opposed to it, but I would need a really compelling use case for it because it looks like a maintenance hazard to me.

BurntSushi avatar Jul 11 '23 03:07 BurntSushi