Halide icon indicating copy to clipboard operation
Halide copied to clipboard

host target detection shouldn't set features that are implied by other features

Open abadams opened this issue 2 years ago • 10 comments

More generally, we need to be smarter about features that imply other features.

If you manually set the target, e.g. x86-64-linux-avx2, then the x86 backend completes this target by setting all implied features to produce x86-64-linux-sse41-avx-f16c-fma-avx2 and uses that to make instruction selection decisions. So you don't need to manually set implied features in a build(*).

Host target detection on such a machine will return the latter completed target rather than the former. This is getting increasingly ridiculous. Sapphire rapids avx512 implies everything in cannonlake avx512, skylake avx512, and all the way down, so on a sapphire rapids machine, host target detection returns:

x86-64-linux-sse41-avx-f16c-fma-avx2-avx512-avx512_skylake-avx512_cannonlake-avx512_sapphire_rapids

zen4's avx512 support is a superset of cannonlake and a subset of sapphire rapids, so once that's merged we'd generate:

x86-64-linux-sse41-avx-f16c-fma-avx2-avx512-avx512_skylake-avx512_cannonlake-avx512_zen4-avx512_sapphire_rapids

This discourages us from being too granular with what generations of thing we support, which is silly. Target strings shouldn't be size O(n) after n generations. If we wanted to be very precise on arm, we'd have to have target strings like:

arm-64-linux-aarchv81-aarchv82-aarchv84-aarchv85 ....

I think we should stop manually setting implied features, and rely on the backends to do the feature bit completion on a local copy of the target as necessary if it's convenient for them to make instruction selection decisions. The only thing we'd really need to change to enact this is to stop setting implied features when we detect the host target. The rest is a matter of usage rather than a code change.

A gotcha I anticipate is if a user schedule switches based on something like get_host_target().has_feature(Target::AVX). We might address this with a new target accessor (supports_feature?) that returns true for implied features too. Or maybe has_feature should return true if the feature is implied. Target.cpp would have to keep a list of feature implications (e.g. avx2 -> avx), and that would also let us centralize target completion for the backends.

Yet another option is to set implied features but not print them when converting a target to a string. I think it's probably better if the set flags in the struct mirrors the string exactly though.

(*) though making that change would break schedules that do things like has_feature(Target::AVX) and want that to return true for avx2, so for existing codebases (e.g. google3) we may not want to change the target strings in use.

abadams avatar Sep 06 '23 18:09 abadams

Here are my current thoughts. There are three main goals that should guide the implementation:

  1. Two targets for which the compiler must behave identically should have the same in-memory and string representations.
  2. After n generations of an architecture, the target string should be size O(1), so that humans can write and read them.
  3. target.has_feature(Foo) should return true for any target for which the compiler must act as if Foo is set.

And then I should precisely define what I mean by an implied feature:

Feature A implies feature B if the compiler must behave identically when A is set and when A and B are both set.

For example, avx implies sse41. It's impossible to get LLVM to use avx instructions without also using sse41 instructions. It just won't do it. avx without sse41 is a meaningless request, because the avx spec includes all the sse41 instructions. To give another even example for a hypothetical pair of target features, armv8.4 implies armv8.3. It is nonsensical to ask for the compiler to make use of armv8.4 instructions without also using armv8.3 instructions. This is what I mean by an implied feature.

To satisfy goals 1 and 3, I think we should normalize targets on parsing them by setting all features implied by any other feature. To satisfy 2, I think that when printing targets we should unset all implied features so that they stay O(1) in size. There would be code in Target.cpp that has a list of all the implied features so that it can do this. We would remove the target completion code from the backends.

abadams avatar Sep 07 '23 15:09 abadams

Toward the last paragraph, I'd argue for a canonical string representation, which is longwinded and ordered, and a pretty printed form. If the architecture specification guarantees that feature_newer implies feature_older then using that implication in our implementation is fine, but we should only do so if that is the case. Unfortunately given how x86 is going, I'm not sure this really is the case often enough to prevent expansion of feature list strings. Also, shortening means one has to know the graph of instruction set dependencies to reason about what the details really are. Hence my desire to make printing the fully expanded form easy and to use it in any sort of build system tags rather than the shortened form.

Hopefully these framing questions will also be helpful:

  • Is there a distinction between "the compiler is allowed to use instruction set feature X" and "this generated code requires instruction set feature X to run?" Specifically, if we default the set of features to everything a recent chip supports, and some of those features are very unlikely to be generated in most output, requiring those obscure features to run the code is wrong. Do we care?

    (My first cut answer is: It would be very useful to have the runtime gating reflect what the code actually requires, but doing so reliably without the user specifying it as the compilation target likely requires an exhaustive check of the generated code.)

  • How important is it to be able to specify arbitrary subsets of a specific chips instruction set support for performance reasons? And relatedly, does the compiler support such fine grained specification? Specifically, assuming use of the longest vector width supported by modern implementations is problematic for performance and power usage. (Thus the distinction in cpu_features between "has AVX512" and "has fast AVX512.") Assuming use of specialized systolic units is problematic if the code does not have sufficient execution on those units. Etc. This establishes a reason to control feature bits other than whether or not code will run on a given implementation.

  • What do we guarantee for compiled code on future architectures? We likely desire that there is no performance regression if the hardware supports the features the code uses. This has implications for how we detect features at runtime. Specifically casing on the implementation identifier can fail to detect valid features on future architectures without an update to the runtime code. Grouping features by implementation either requires that the architecture specifies some features must be implemented if others are, or assumes vendors will not take advantage of the opportunity to implement some but not all of those features in the future. (And this must work across all vendors of a given architecture.)

  • How much do we care about runtime complexity of feature detection? Writing a runtime library routine to enumerate all the features is likely more complex than using code generation to emit a binary expression for "Will this kernel run on the current hardware?"

(As an aside here, a lot of this is a side effect of failures in computer architecture. It's a bit much to expect something simple and logical when the forcing function is effectively adversarial due to fundamental problems in the ecosystem being modeled. RISC V is also a case where there are a lot of features that can be controlled separately, largely resulting from a lot of vendors all trying to do different things and effectively wanting slightly different architectures. The issue seems to be inherent to any instruction set that gets used and implemented by a lot of people over a long period of time.)

zvookin avatar Sep 08 '23 21:09 zvookin

One more thing: we probably ought to look at breaking out instruction set feature flags from other types of features and possibly grouping the ISA ones by architecture.

zvookin avatar Sep 08 '23 21:09 zvookin

And another thing is tuning vs. what instructions are allowed to be used. The set of names for things to tune for is pretty large and determined by the underlying backend. E.g. LLVM. I still think there's a lot of value in being able to pass through flags directly to the backend. Backing out the processor tuning enum and just having a way to pass flags through would be an improvement.

zvookin avatar Sep 08 '23 21:09 zvookin

I am indeed only talking about cases where the architecture specification guarantees it. x86 and riscv are not going to be single clean hierarchies. Because I'm talking about strict implication, there's no difference to machines whether the short or long form is used. They mean the same thing and will have the same in-memory representation. There's only a difference to human readers and writers.

Personally, I don't want to ever be reading or writing the target string "arm-64-linux-aarchv81-aarchv82-aarchv83-aarchv84-...". People can disagree and write whatever they want and it will get normalized, so the confusion can only occur on the reading side when humans need to parse things output by the system.

I originally wrote a longer argument here for why the short name will always be better, but it would be better to punt for now and look at real use cases in a PR instead of making general arguments. E.g. in places where the target leaks into a path name I think we're always going to want to use short names due to path length limits on windows and terminal wrapping issues.

abadams avatar Sep 08 '23 22:09 abadams

"... there's no difference to machines whether the short or long form is used" this is not correct if the the compiler substrate allows finer grained control and there is a performance reason to disable something even if it is guaranteed to be present.

Note that we chose to use a short form for AVX512 where there is no architectural guarantee. I really don't see a way to do this other than to separate the usable short form and a full specification canonicalized form. The canonicalized form can be compacted into fewer characters in a number of ways with a loss of readability, though it can also be split into pathname components if that is the big issue. Note there are a lot of non-ISA feature flags as well.

zvookin avatar Sep 13 '23 00:09 zvookin

I'm making it true by definition. If you can think of a case where it's meaningful to have feature A enabled and feature B disabled, then A does not imply B. So an equivalent definition of what I mean by "implied feature" for the purposes of this issue is that there's no difference to machines whether or not the short or long form is used.

So armv8.4 implies armv8.3, and avx2 implies avx, because in each case the former spec subsumes the latter, and it's not a meaningful request to ask a compiler to use armv8.4 instructions but not armv8.3 instructions.

I'm having a hard time of thinking of an example we had that looks like it might be an implied feature but isn't. If we had a feature that meant "target skylake", then it would not imply avx512, because that's a separate decision that can be made for power/perf reasons as you say. The feature we do have that mentions skylake by name is the avx512_skylake feature, which means "enable the set of avx512 extensions present on skylake", so it does imply avx512. I guess in a target string you can say "tune-znver4", which means tune for zen4, but that wouldn't imply any features at all, because I you might want to tune our baseline x86-64 sse2-only for zen4 for some reason. It's a silly request, but it's a meaningful one, so it's not an implied feature.

abadams avatar Sep 13 '23 00:09 abadams

Something a bit weird I thought I should raise in this thread:

If A implies B, should it be possible to construct a Target object that has feature A set but not feature B set? I.e. are Target objects with A set and B unset considered malformed? I'm inclined to say yes. If they're malformed, then set_feature and with_feature would have to also set implied features.

If these targets are permitted, and we only do canonicalization on string conversions, then we have this unfortunate situation:

Target{"x86-64-linux-avx2"}.has_feature(Target::AVX); // true
Target{"x86-64-linux"}.with_feature(Target::AVX2).has_feature(Target::AVX); // false. Weird instruction selection may result

To keep the string and programmatic Target construction methods identical, if you set a feature we must also set all implied features (turning on avx2 turns on avx), and if you unset a feature you also unset any features that would imply the feature being unset (turning off avx turns off avx2).

However, then you get the annoying property that turning on and then off a feature programmatically leaves all the implied features turned on:

Target{"x86-64-linux"}.with_feature(Target::AVX2).without_feature(Target::AVX2).has_feature(Target::AVX); // true

I'm not sure how disturbed I am by that. It seems less bad than being able to construct meaningless targets.

abadams avatar Sep 27 '23 19:09 abadams

This is more of a practical comment towards implementation, not really anything new. Just my few cents on what I think would be nice to use.

I'm thinking that if you make an mapping of Feature -> Feature[] representing all implied features given one other feature (in a non-recursive fashion to keep the lists short*), it should be fairly straightforward to:

  • use the mapping to enable the implied features (when parsing target strings, or when calling with_feature()), and
  • use the same mapping to strip away all the implied features again when generating the target string.

I think that's trivial to implement, and achieves the 3 goals. Additionally, it is very informative and intuitive to examine such a data structure for people familiarizing themselves with new architectures (e.g., today I have never worked with ARM and looking at the mappings can be super handy).

(*) What I mean with non-recursive is that you'd have mappings like below, and the code traverses the mappings recursively until all implied features are set recursively:

AVX2 -> AVX
AVX -> [SSE41, FMA, F16C]
SSE41 -> MMX
ARMv8.3 -> ARMv8.2
ARMv8.2 -> ARMv8.1

(I'm making up stuff, I don't know yet which feature exactly implies FMA and F16C.)


Regarding:

However, then you get the annoying property that turning on and then off a feature programmatically leaves all the implied features turned on:

Target{"x86-64-linux"}.with_feature(Target::AVX2).without_feature(Target::AVX2).has_feature(Target::AVX); // true

Would this ever happen in practice? Personally, I have never used .with_feature(). I'm always using AOT Generators. Maybe I'm missing useful cases for programmatically setting features?

mcourteaux avatar Oct 13 '23 11:10 mcourteaux

I'm not sure how disturbed I am by that. It seems less bad than being able to construct meaningless targets.

It may be unreasonably hard to make it impossible to prevent "bad" Targets from being constructed, but we should certainly do what we can to detect and forbid such Targets where it's easy to do so

steven-johnson avatar Oct 15 '23 01:10 steven-johnson