highway
highway copied to clipboard
IndicesFromVec, SetTableIndices and TableLookupLanes {u,i}{16} support
IndicesFromVec, SetTableIndices and TableLookupLanes current only support {u,i}{32,64} for integers. I'm trying to figure out what is required to support for {u,i}{16} on x86. I'm bacoming familiar with the user side for highway but haven't implemented anything inside of highway itself yet so it might be a bit out of my depth until I become familiar with the highway template system.
It seems it should be possible to support 16-bit table lookups on x86 with:
-
PSHUFB
(SSSE3) -
VPSHUFB
(AVX,AVX2) -
VPERMW
(AVX2, AVX-512)
Although one would need to translate tables for use with PSHUFB
. VPERMW
already uses 16-bit tables. is my assumption correct that the difficulty is translating word indices to bytes indices so that they can be used with PSHUFB
on 128-bit x86? e.g.
- {
0000 0001 0002 0003 0004 0005 0006 0007
} to {0001 0203 0405 0607 0809 0A0B 0C0D 0E0F
}
For IndicesFromVec
one needs code to replicate odd-to-even lanes and add 1 to odd lanes. is there anything required for SetTableIndices? can any of this be done as constant expressions so that transformed constants are stored in the text?
Tangentially :- for other architectures, could the word to byte index translation be implemented generically? I don't know enough about table lookups on the other vector architectures but I know they all have TableLookupBytes
and that approach might work for some of them as an intermediate fix? although that won't work for AVX-512 where TableLookupBytes
indices are 4-bit quadrant local. so I think AVX2 and AVX-512 really need to use VPERMW
to get indices that can access all lanes.
noting that I made a couple of factual mistakes about shuffle and permute on x86. VPERMW
does not exist for AVX2 or AVX512B rather it is part of AVX512VL. alsoVPSHUFB
would also have the 4-bit lane limitation for 256-bit vectors...
Hi @michaeljclark, great to hear you're interested in looking into this. It would indeed be nice to extend those to 16-bit lanes and I'm happy to help. Have you seen https://github.com/google/highway/blob/master/g3doc/impl_details.md ? That has some pointers on how to add new ops (in this case, the functions already exist and it's more a case of adding extra overloads for the 16-bit types).
Yes, it seems very doable to adapt 16-bit indices to 128-bit PSHUFB as you suggest. SetTableIndices is the same as IndicesFromVec except that it loads from memory.
The other platforms actually go to extra effort to have their TableLookupBytes behave like x86 (limited to "4 bits" as you say). On those platforms, we generally have TableLookupLanes which can already index arbitrary positions within a vector, so it's just a matter of allowing 16-bit types also.
The main effort will be for AVX2, because AVX3 already has VPERMW as you mentioned. So for AVX2, what we could do is do two TableLookupBytes (for upper and lower 128 bit pieces) and then IfThenElse between them using bit 4 of the byte index (which PSHUFB ignores). Or does anyone see a better way?
Hi @jan-wassenberg, thanks for your detailed reply. Yes, I'd like to help but I don't want to stop someone who might attempt this because in the meantime I found a workaround that I can implement in user code so I'll leave this here as a feature request but will give it a try in the background when I have time.
This came about because I was looking into demotions from 64-bit types using DemoteTo
for which {u,i}{64} support is missing. So I came up with a U32FromU64
truncation as a workaround that was similar in form to U8FromU32
by using BitCast, TableLookupLanes then LowerHalf, which I then used manually before calling DemoteTo to handle 64-bit integer demotions. It then occurred to me that I could also implement a U16FromU64
in one step using VPERMW on AVX-512 because its indices are able to access all lanes. That's when I hit this 16-bit lane limitation which prevented me from trying that so I moved on.
Thus truncating versus saturating demotions was another issue I hit. I am able to use truncating conversions because I have already performed a MaxOfLanes
pass to figure out the minimum size that I can truncate to (as I am using Highway to implement a compressed integer array where I reduce blocks to the minimum type width that holds the maximum block delta).
Has anyone discussed a more regular API for truncating demotions? Adding U32FromU64
would be extending the piecemeal approach set by U8FromU32
. Should there also be a TruncateTo
template?
So if you think it worthwhile I could also raise tickets for these two issues:
- DemoteTo {u,i}{64} support
- not sure there is an instruction for this so it may need less-than INT_MIN, greater-than INT_MAX, IfThenElse, ...
- alternatively one could use a logical equals-to comparison to see if the even dwords are all zeros or all ones
- TruncateTo {u,i}{64,32,16} support
- some formalism for
U8FromU32
to make a more complete set of truncating demotions - isn't this odd lanes for one level?
- some formalism for
I guess DemoteTo {u,i}{64} support is documented by its absence, but there is the issue of a formalism for Truncate. Now I think about it, I should be able to extract odd lanes of the BitCast of one type smaller using an existing API? ...
Also on the naming of reductions. I haven't found ReduceOdd and ReduceEven primitives (UnzipOdd or UnzipEven?). I see DupOdd and DupEven but they don't return a vector that is half the width. It occurred to me that they would be useful for TruncateTo using BitCast. I do not know if they exist but I don't know the name. Oh. ConcatOdd and ConcatEven? They don't reduce. So I guess an UnzipOdd or UnzipEven would be LowerHalf(ConcatOdd(v,Zero())) and LowerHalf(ConcatEven(v,Zero()))
Good point about U8FromU32 being a piecemeal approach. I agree this is not the best path and think that TruncateTo is a great idea. We can then deprecate U8FromU32, and implement it in terms of TruncateTo.
As to DemoteTo from u/i64: this seems reasonable. Our initial approach was to only provide conversions which are efficient everywhere. Since then we've shifted much more towards "useful operations that apps can't emulate more efficiently, and aren't painfully slow on any platform". Per that policy, supporting 64-bit totally makes sense.
Truncate: yes, that could be implemented as LowerHalf(ConcatEven(v, v) (even because lane 0 is the lowest). But that would be wasteful on SVE, which requires extra fixups just in case the HW vector length isn't a power of two. We could avoid that by adding TruncateTo which on SVE is indeed just return detail::ConcatEven(v, v);
(no fixup required because we only care about the lower half).
We'd welcome patches for either or both :)
Hi Jan,
Thanks for working on TruncateTo. I did something similar on the user side using TableLookupLanes. I will check out the changes and test TruncateTo. I myself worked on a CombineShiftRightLanes which is like CombineShiftRightBytes but at the lane level and not suffering from the quadrant issue. It is named similarly to TableLookup{Lanes,Bytes}. Otherwise, I think the funnel lane alignment shift is not exposed to the user in a primitive that does not suffer from the quadrant restriction. I make extensive use of it in my user code. It uses VALIGN{D,Q} on AVX-512 and I also implemented some emulation for SSE4/AVX2.
I don't have anything up to the standard of a pull request yet so I put my diffs into a gist so that you can see where I'm headed.
- CombineShiftRightLanes{32,64} - uses VALIGN{D,Q} or emulates using permutes so it is always full width
- PromoteTo{8,16} -> {64} - uses the sign extension right arithmetic shift technique or uses intrinsics on AVX-512
- TableLookupLanes{16} - uses intrinsic on AVX-512 but still needs conversion of indices for AVX/SSE
Here is the gist with my patch: https://gist.github.com/michaeljclark/d90480b95508082aac151b8e9909bb50
I need to write tests for the emulations. I'm not actually compiling for AVX2 but I wrote emulations for CombineShiftRightLanes. I have some things in user code that might be useful. I also implemented a Parallel Prefix Sum using consexpr_for emulation that works in C++17. When I get time I'll try and tidy it up. I only have access to SkylakeX so that's all I can test locally.
Thanks Again, Michael.
Oh. I should be explicit that anything in the gist I agree to be included under the highway license if you choose to use any of it before I get time to make a proper pull request with tests, etc. It probably needs to be split out into 3 different patches.
@michaeljclark sorry to reply super late, I missed this in my inbox. Your patch looks like a great start! Two minor comments: there is already a CombineShiftRightLanes, so we'll want to rename to something else. It's fine to choose any name because the across-block behavior is just something to be documented. Perhaps *AllLanes would be helpful for making that clear.
Also, I see >= HWY_AVX3
: lower is better, so you probably want <= HWY_AVX3
which matches HWY_AVX3
, HWY_AVX3_DL
, and anything newer.
Would be happy to advise on any remaining issues or how to implement this for other platforms 👍
No worries about the lag it is okay. Things take time for me too...
I think I used CombineShiftRight{Lanes,Bytes} to be consistent with TableLookup{Lanes,Bytes}. It seems the existing CombineShiftRightLanes was not fully implemented which is why I might have used it and I imagined the quadrant semantics would be consistent with TableLookup{Lanes,Bytes}. I am not sure whether the current choice is consistent if Lanes call has a quadrant limitation whereas TableLookupLanes doesn't.
I'll look into seeing what I did with >= HWY_AVX3
. When I tried with -mavx2
I think I got 128-bit vectors but I implemented the 256-bit variants anyhow even though they were not selected. There was an inverted case that I had to knock out when compiling with AVX2 or AVX3 given there are some 256-bit methods that are part of AVX-512. Ah yes, so if those 256-bit methods are defined in x86_512-inl.h I switch them off in x86_256-inl.h. I will double check it... It needs test cases...
about CombineShiftRight{Lanes,Bytes}. It should be consistent with TableLookup{Lanes,Bytes} no? does it mean there should also be TableLookupAllLanes? I guess this is for you to resolve how you want to make it clear to the user which method have the quadrant limitation. Given the precedent I would have assumed that a CombineShiftRightLanes doesn't. So in fact I have sort of implemented emulation which is that odd case with the < HWY_AVX3 because I define the 256-bit AVX3 ops in x86_512-inl.h
I may update (have updated) that gist. At some point I'll make a PR when I have tests and have resolved the naming.
Thanks @johnplatts for implementing this :D