highway icon indicating copy to clipboard operation
highway copied to clipboard

integer division/rdiv, modulo, atan2

Open kfjahnke opened this issue 2 years ago • 19 comments

Is there code to do integer division/remainder division and/or modulo? And I'd also like to see atan2!

kfjahnke avatar Apr 11 '22 08:04 kfjahnke

We haven't needed div/mod yet. I'm curious as to what the use case is?

For atan2, my colleague who implemented the math library is currently out of office, but it's pretty straightforward to implement using the existing atan (https://en.wikipedia.org/wiki/Atan2#Definition_and_computation). Would welcome a patch if you want to give that a go?

jan-wassenberg avatar Apr 11 '22 08:04 jan-wassenberg

We haven't needed div/mod yet. I'm curious as to what the use case is?

Generic programming. If I don't have Div defined for all types, I need to special-case and route floating point code to use Div, and integer code to use e.g. a loop over the vector lanes, which is cumbersome. And I think we needn't really discuss the usefulness of modulo? It might be enough to have rdiv and use the remainder for modulo. I'm just in the process of building up a generic type for chunks and rolling out the operator functions with macros, and the special-casing for ops which don't cover all fundamental types is annoying. BTW - the chunk type is coming along nicely, and it looks like using the memory-backed container approach will be the most efficient. For now I route the code to loop over the backing memory for integer division and modulo, but that is slow. Proper vector code would be much appreciated!

Would welcome a patch if you want to give that a go

Okay, I'll see what I can do. Do you think it would be okay (licensewise) to adapt Vc's implementation? I'd rather not reinvent the wheel.

kfjahnke avatar Apr 11 '22 09:04 kfjahnke

Actually I haven't seen modulo used in SIMD code, either :) It seems an awkward fit for a SIMD wrapper because no instruction set that I know of supports it. It's probably going to be very slow no matter what we do?

Makes sense. I'm not a lawyer, but Vc is BSD licensed and the code would basically end up rewritten because of the differing underlying vector API, so I'd think that's new code even if inspired by the old and thus we'd seemingly be OK license-wise.

jan-wassenberg avatar Apr 11 '22 12:04 jan-wassenberg

Am 11.04.22 um 14:00 schrieb Jan Wassenberg:

Actually I haven't seen modulo used in SIMD code, either :) It seems an awkward fit for a SIMD wrapper because no instruction set that I know of supports it. It's probably going to be very slow no matter what we do?

It's not so hard - for A % B you just do a (truncating) integer division like _mm256_div_epi32 to get the quotient Q (which I'd like to see as the result for Div with integer vectors), then the module is M = A - B * Q, so that M == A % B - maybe with a bit of finessing for signed integers. rdiv - if it were to be had - would be ( Q , M ) = rdiv ( A , B ), with A == Q * B + M. All of this can be done with very few vector instructions; the integer division will probably be the slowest part. I can't see why it should be slow. Of course there's likely no direct equivalent in assembler, it would need to be a function, just like atan2 has to be implemented as a function.

kfjahnke avatar Apr 11 '22 16:04 kfjahnke

I agree computing modulo is straightforward if we can divide.

_mm256_div_epi32

Unfortunately that's an SVML function instead of an actual instruction. Probably the fastest implementation is to promote int32 to double and use divpd, but that doesn't work for int64.

I don't mind adding ops, but it's important to have a use-case that cares about performance first. In this case it seems we could first have a wrapper function at user level?

jan-wassenberg avatar Apr 12 '22 07:04 jan-wassenberg

Okay, fine, I'll roll it out to scalar for now. It's probably not used much. If it turns out to be a bottleneck, a faster implementation can be added later. I'd rather have a replacement simd object type ready ASAP to get a first impression of how my code works with a highway backend, the tweaking for speed can be done later.

kfjahnke avatar Apr 12 '22 08:04 kfjahnke

Sounds good :)

jan-wassenberg avatar Apr 12 '22 13:04 jan-wassenberg

With my new SIMD template class I can now route the divisions and modulo to goading where needed, and I've tentatively used Wikipedia to code a vectorized atan2 version inside my class. I also had a go at using Vc's code (which ports well, because the interface of my class is the same as Vc::SimdArray) which seemed even faster. Vc uses isnan, copysign and isfinite, which might be nice to have in highway, but I could not find the likes.

kfjahnke avatar Apr 19 '22 06:04 kfjahnke

We do have CopySign, I can add IsNaN and IsFinite soon.

jan-wassenberg avatar Apr 19 '22 08:04 jan-wassenberg

We do have CopySign, I can add IsNaN and IsFinite soon.

Okay, with that I can probably port from Vc's atan2 without much ado! Let me know when the functions are there.

kfjahnke avatar Apr 19 '22 09:04 kfjahnke

@kfjahnke it's in :)

jan-wassenberg avatar Apr 21 '22 11:04 jan-wassenberg

I have ported Vc's atan2 code to use highway internally, but before publishing it, I'd like their advice on copyright/license matters, because the code structure is still Vc's, even though the functionality is implemented with highway code. So I've opened an issue with Vc. I have done some tests with this new implementation, and coding at the hwy Vec level seems to produce a noticeable but small speedup, likely because the rather large routine is too much for the compiler to 'grasp'.

kfjahnke avatar Apr 23 '22 08:04 kfjahnke

Nice that you've reached out to them for clarification. Interesting, glad you're seeing a good result now that we've added the Is* :)

jan-wassenberg avatar Apr 25 '22 06:04 jan-wassenberg

The atan2 implementation I wrote at the hwy::Vec level uses ordinary highway function calls, including the new ones. Before I come out with the code I want to give the Vc people some time to say what they think about me porting it. It's not such a big deal - basically what's needed is an efficient way of dealing with a few corner cases and figuring out the right quadrant, and I suppose if your math guy hadn't been on holiday you'd have had the hwy function within half an hour without me pussyfooting around the issue. But with my hwy::Atan2 implementation I have now just about broken even: my tests with the highway-based code come out roughly as fast as the Vc version! It was the missing bit which made all the difference. An aside: I scanned M. Kretz' implementation of std::simd (and the upcoming version in GNU's C++ library which is very similar) and found that many math functions seem to be 'rolled out' over the lanes as well, in contrast to Vc's hand-coded math functions. This may explain why my code variants using std::simd are sometimes so slow.

kfjahnke avatar Apr 25 '22 08:04 kfjahnke

Ah, you mean like this? https://github.com/VcDevel/std-simd/blob/master/experimental/bits/simd_math.h#L975 Interesting, did not know some of the implementation is a scalar loop, but I suppose it's understandable if those are implemented mainly for completeness rather than speed.

jan-wassenberg avatar Apr 25 '22 08:04 jan-wassenberg

I find the code hard to fathom, but to me it looks like the actual implementation is in bits/simd_builtin.h, and amounts to _GLIBCXX_SIMD_MATH_FALLBACK(atan2), which basicly uses a loop over the zipped vectors. If the vectorized implementation of atan2 is somewhere there, I could not find it (as opposed to Vc which clearly has it). Of course there might be a macro somewhere taking 'atan2' as an argument and doing something interesting with it that you won't find by grepping the code. Maybe I'm missing something. I think it's indeed implemented for completeness, as you suspect. It works, but it's slow.

kfjahnke avatar Apr 25 '22 11:04 kfjahnke

Indeed very hard to read. Looks like you are right, I don't see another implementation of atan2 either. But: there is a cosSeries which looks like a (vectorized) polynomial even though simd_builtin.h also has _GLIBCXX_SIMD_MATH_FALLBACK(cos). Strange.

jan-wassenberg avatar Apr 25 '22 15:04 jan-wassenberg

I did a 'proof by sabotage' and commented out the invocation _GLIBCXX_SIMD_MATH_FALLBACK(atan2) in simd_builtin.h. This prevents my code (which invokes the atan2 function) from compiling, revealing the call chain:

/usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/experimental/bits/simd_math.h:446:1: error: no member named '_S_atan2' in 'std::experimental::parallelism_v2::_SimdImplX86<std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> >'
_GLIBCXX_SIMD_MATH_CALL2_(atan2, _Tp)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/experimental/bits/simd_math.h:135:23: note: expanded from macro '_GLIBCXX_SIMD_MATH_CALL2_'
            _Abi::_SimdImpl::_S_##__name(__data(__x), _Arg2::_S_data(__y))};   \
            ~~~~~~~~~~~~~~~~~^
<scratch space>:724:1: note: expanded from here
_S_atan2
^
/usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/experimental/bits/simd_fixed_size.h:1621:39: note: in instantiation of function template specialization 'std::experimental::parallelism_v2::atan2<float, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32>, std::experimental::parallelism_v2::_Extra_argument_type<float, float, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> > >' requested here
    _GLIBCXX_SIMD_APPLY_ON_TUPLE(_Tp, atan2)
                                      ^
/usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/experimental/bits/simd_math.h:446:1: note: in instantiation of function template specialization 'std::experimental::parallelism_v2::_SimdImplFixedSize<16>::_S_atan2<float, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32>, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> , std::experimental::parallelism_v2::_SimdTuple<float, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32>, std::experimental::parallelism_v2::simd_abi::_VecBuiltin<32> > >' requested here
_GLIBCXX_SIMD_MATH_CALL2_(atan2, _Tp)
^
/usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/experimental/bits/simd_math.h:135:23: note: expanded from macro '_GLIBCXX_SIMD_MATH_CALL2_'
            _Abi::_SimdImpl::_S_##__name(__data(__x), _Arg2::_S_data(__y))};   \
                             ^
<scratch space>:724:1: note: expanded from here
_S_atan2
^
/home/kfj/pv/vspline/std_simd_type.h:397:23: note: in instantiation of function template specialization 'std::experimental::parallelism_v2::atan2<float, std::experimental::parallelism_v2::simd_abi::_Fixed<16>, std::experimental::parallelism_v2::_Extra_argument_type<float, float, std::experimental::parallelism_v2::simd_abi::_Fixed<16> >, std::experimental::parallelism_v2::simd<float, std::experimental::parallelism_v2::simd_abi::_Fixed<16> > >' requested here
  BROADCAST_STD_FUNC2(atan2)

The sine and cosine are indeed implemented in vectorized form. The fallback exists for everything, but if there's something 'better' it's pushed to the background.

kfjahnke avatar Apr 25 '22 15:04 kfjahnke

Oh, OK, that makes sense :)

jan-wassenberg avatar Apr 25 '22 15:04 jan-wassenberg

I'm updating the wishlist (now that we have one) with Div/Rem. @kfjahnke did you want to send a pull request for your atan2 implementation?

jan-wassenberg avatar May 17 '23 13:05 jan-wassenberg

did you want to send a pull request for your atan2 implementation?

The problem with this implementation is that it's not my implementation: it's taken from Vc and 'bent' to highway. This makes the licensing complex - if you look into the code, e.g. here, you can see that there are three licenses involved: the one of Vc's initial implementation, highway's license for some infrastructure code I used to integrate the code into highway, and my own license for the derivative work. On the other hand, the implementation itself is nothing out of the ordinary, so the whole thing may not be sufficiently original to even need such elaborate licensing - it's really just a matter of getting the quadrants right and to express the process in SIMD code with masks rather than the scalar version's conditionals and to handle corner cases so that the result conforms to standards. But I am not a copyright lawyer. As far as Vc's license goes, I asked and they insisted on attribution, as per Vc' license. As far as highway is concerned, you'd not have a problem using your own code ;) When it comes to my own effort, I haven't signed the contributor agreement with google, and I'm not sure if I want to. But my license is liberal (MIT), and so is Vc's. If you can live with the licensing as-is you can simply grab the file and put it into the highway code base - the license expressly and deliberately gives you the right to do so. If you'd prefer me to issue a pull request to that effect I suppose could do that, leaving the file as it is.

I'm updating the wishlist (now that we have one)

Hey, great, where is it?

kfjahnke avatar May 18 '23 06:05 kfjahnke

Yes, I understand. Based on Vc's comments I think we can just add their BSD header plus yours if you want before the Atan2 function and it would be fine.

We'd welcome you to send a pull request because it's your code; let's just copy the (Call)Atan2 functions into math-inl.h, along with the license header comment block. And bonus points for also removing it from the wishlist, see below.

But this will require you to sign the CLA which basically just gives us permission to use your code. If you don't want to do that, then yes, I can do the pull request as well, though that doesn't credit you as the author as clearly (e.g. in Git history).

Hey, great, where is it?

https://github.com/google/highway/blob/master/g3doc/op_wishlist.md

jan-wassenberg avatar May 19 '23 10:05 jan-wassenberg

But this will require you to sign the CLA which basically just gives us permission to use your code. If you don't want to do that, then yes, I can do the pull request as well, though that doesn't credit you as the author as clearly (e.g. in Git history).

Right now I'd rather not sign the CLA. The licenses in the file do give you permission to use the code, and you probably have better resources than I to help you determine how the atan2 implementation can be integrated into your code base without stepping on anyone's toes. As credits to myself go, having my copyright and permission notice for the part I did in the source code is enough credit for me - as stated in my license.

I did the coding work because I could not run my program (lux) satisfactorily with my highway back-end without a fast atan2 function. If the code moves to your code base this is a relief - I would have preferred your side to have provided it straight away, but since I had plenty of other reasons to use highway (like, ARM support, AVX512...) I went ahead and ported the function from the source I knew, and to good effect. I hope it will be a welcome addition to your code base and help other users!

As discussed I think you can drop my quotation of the original Vc code.

Can you propose something which could serve as a sentinel for the function? So that I can conditionally include my own source depending on whether it's already available or not on a target system? (like #ifdef HAVE_HWY_ATAN2)

kfjahnke avatar May 23 '23 07:05 kfjahnke

I understand :) hm, I've copied this in, implemented a test according to https://en.cppreference.com/w/cpp/numeric/math/atan2 and it looks like the code isn't yet passing.

I'll send an implementation out for review which does pass, and I believe is not derived from your nor Vc code, so we can skip the license headers.

Would you like to notify Vc that their code may potentially have a bug? I believe it's that atan2(y=4, x= -0) != pi / 2.

Can you propose something which could serve as a sentinel for the function? So that I can conditionally include my own source depending on whether it's already available or not on a target system? (like #ifdef HAVE_HWY_ATAN2)

Ah, makes sense. Sure, will define HWY_HAVE_ATAN2 to 1.

jan-wassenberg avatar May 23 '23 18:05 jan-wassenberg

Would you like to notify Vc that their code may potentially have a bug? I believe it's that atan2(y=4, x= -0) != pi / 2.

No, I checked. The fault must be in my port, the Vc implementation produces the correct result. I tried to use your new implementation in my highway back-end, but it's not yet online. Luckily, it's a corner case.

Thanks for the sentinel, thanks for finding the bug and thanks for your atan2 implementation!

kfjahnke avatar May 24 '23 07:05 kfjahnke

Ah, ok. Thanks for verifying the result. We'll have the code merged soon. You're welcome! With this many special cases, it's difficult to cover them all :)

jan-wassenberg avatar May 24 '23 08:05 jan-wassenberg

This sentinel does not work for me:

#ifndef HWY_HAVE_ATAN2
#define HWY_HAVE_ATAN2 1
#endif

the '#endif' should go after the actual code, so I can #define the sentinel myself to avoid your code if I want to use mine. As it is, I can't include math-inl.h and use my own implementation, because that results in an ambiguity.

A first test using your atan2 implementation in lux produces wrong results: I get half the 360 degree image, mirrored. I'll investigate...

kfjahnke avatar May 24 '23 09:05 kfjahnke

A first test using your atan2 implementation in lux produces wrong results: I get half the 360 degree image, mirrored. I'll investigate...

Oops.. have to eat my words, I think it's my mistake.

kfjahnke avatar May 24 '23 09:05 kfjahnke

Ah, I had assumed the idea was that we define a flag so that your code knows not to define your own Atan2. Would that work for you?

jan-wassenberg avatar May 24 '23 09:05 jan-wassenberg

I think it should work both ways. Just put the #endif after the function, then all is well and I can 'blot it out' if I want to use mine instead.

kfjahnke avatar May 24 '23 09:05 kfjahnke