highway icon indicating copy to clipboard operation
highway copied to clipboard

Integrated vector version of Xoshiro in contrib

Open DiamonDinoia opened this issue 1 year ago • 15 comments

Dear maintainers,

I implemented a vector version of xoshiro^1 in contrib.

This is a draft as I would like to know if there is interest in merging this.

If so I would like to have advise on how to structure the code and what type of refactor would be needed.

Thanks, Marco

DiamonDinoia avatar Jan 24 '24 17:01 DiamonDinoia

Nice, thanks for suggesting this. Generators seem widely useful. There's one structural issue (supporting non-constexpr-length vectors) which would be important to address, and some other style changes if you don't mind?

Hi, I do not mind. I will address your comments in the following days.

DiamonDinoia avatar Jan 24 '24 18:01 DiamonDinoia

Nice, thanks for suggesting this. Generators seem widely useful. There's one structural issue (supporting non-constexpr-length vectors) which would be important to address, and some other style changes if you don't mind?

Hi, I do not mind. I will address your comments in the following days.

Dear @jan-wassenberg,

I made some changes. Could you:

  1. Check that it makes sense
  2. It does not break tests as the previous did?
  3. I plan on optimizing the vector size further, I would like to have something that works first.

DiamonDinoia avatar Jan 29 '24 15:01 DiamonDinoia

Dear @jan-wassenberg,

I addressed the comments. Could you trigger the test pipeline?

Can you check that I understood your suggestion on class member correctly? I used std::array<std::array> as it should be stack allocated -> no pointer chasing even though is 2D. The limitation is that there is some storage waste when we require fewer than 8 streams. Alternatively it is possible to use std::array<AlingedArray> which has the following tradeoff: Pro:

  1. no space is wasted (padding might be required for alignment)
  2. Load instead of LoadU and Store instead of StoreU

Cons:

  1. Pointer chasing

Which one do you think is going to be faster in your opinion?

I prefer not to force 8 streams as the function next or unifom return a Vec<T> which can be dirtecly used inside another highway function, should I force 8 streams and return an array of 8 elements?

Depending on the answers above I plan to add:

  • A function to generate N (where N is a multiple of the Lanes) random numbers and return them in an alinged array. This is potentially faster as it reduces the number of LoadU and StoreU needed as once the state is loaded in register can be used and stored again after computing all the elements.

  • a BufferedXoshiro that generates N numbers at once and returns them one at the time. This can also be compatible with std::random but potentially faster as generates the numbers in one go and stores them until all consumed. I did not make this the default behavior as some platforms might have a limit on the memory available

  • Documenting the above.

DiamonDinoia avatar Jan 30 '24 18:01 DiamonDinoia

Dear @DiamonDinoia,

thanks for making the changes. Sorry that our builds are currently red, our internal build has different settings and unfortunately this leads to whack-a-mole, there is a larger update to thread_pool.h on the way.

I'm not sure I understand the difference between array and array. Pointer chasing or Gather are slow and best avoided when possible.

I prefer not to force 8 streams as the function next or unifom return a Vec which can be dirtecly used inside another highway function, should I force 8 streams and return an array of 8 elements?

If you want to return a vector, then I'd strongly suggest you adopt a class interface that does not store the internal state at all, but rather exposes them as two by-ref arguments that the caller passes in. This avoids any waste, and then callers can iterate until their desired number of streams has been satisfied by the number of vectors processed so far.

The Xorshift128Plus in hwy/contrib/sort/algo-inl.h is an example of this, but it only supports a single vector. If you want multiple streams, then the caller would Load[U] those vectors from an array of uint64_t. BTW feel free to move Xorshift128Plus to your new file if you like.

A function to generate N (where N is a multiple of the Lanes) random numbers and return them in an alinged array.

This sounds useful :)

This is potentially faster as it reduces the number of LoadU and StoreU needed as once the state is loaded in register can be used and stored again after computing all the elements.

For what it's worth, clang is pretty good at optimizing out redundant loads, and they are anyway quick, so perhaps not a huge concern.

a BufferedXoshiro that generates N numbers at once and returns them one at the time. This can also be compatible with std::random but potentially faster as generates the numbers in one go and stores them until all consumed. I did not make this the default behavior as some platforms might have a limit on the memory available

This also sounds useful. As is often the case, portability requires dynamic allocation of N*Lanes(d) elements, you can choose how many vectors N to buffer.

FYI I am currently preparing for an conference and traveling tomorrow, hence have a bit higher response time.

jan-wassenberg avatar Feb 02 '24 15:02 jan-wassenberg

Dear @jan-wassenberg

I updated the branch with the following:

  1. AlignedNDArray -> no pointer chasing, aligned storage no space is wasted as the number of streams is dynamic allocated (not sure how I missed its existence;
    1. Implemented and AlignedAllocator for convenience (just a wrapper on the existing methods),
  2. Implemented and AlignedVector for convenience;
  3. Provided methods that generate N numbers, if N is known a compile time is possible to use the template function that returns a std::array, otherwise an AlignedVector is returned; They are slightly faster as they avoid redundant load/store. If compilers can get rid of those themselves we can consider getting rid of them;
  4. Added support to multi-threading, the issue is proper seeding so that streams are correlated
  5. Added CachedXorshiro it stores N (where n is a power of 2) numbers generated in one go and it is compatible with std::random, any distribution from std::random can be used.

Could you review, trigger a test pipeline? Could you resolve addressed comments? I could do it but I think is not good practice. Let me know if you want me to.

Few comments below:

thanks for making the changes. Sorry that our builds are currently red, our internal build has different settings and unfortunately this leads to whack-a-mole, there is a larger update to thread_pool.h on the way.

No worries

I'm not sure I understand the difference between array and array. Pointer chasing or Gather are slow and best avoided when possible.

I prefer not to force 8 streams as the function next or unifom return a Vec which can be dirtecly used inside another highway function, should I force 8 streams and return an array of 8 elements?

If you want to return a vector, then I'd strongly suggest you adopt a class interface that does not store the internal state at all, but rather exposes them as two by-ref arguments that the caller passes in. This avoids any waste, and then callers can iterate until their desired number of streams has been satisfied by the number of vectors processed so far.

Solved this issue with AlignedNDArray, this also allows to maintain proper encapsulation, slight downside redundant loads and stores in certain circumstances.

The Xorshift128Plus in hwy/contrib/sort/algo-inl.h is an example of this, but it only supports a single vector. If you want multiple streams, then the caller would Load[U] those vectors from an array of uint64_t. BTW feel free to move Xorshift128Plus to your new file if you like.

Different algorithm, I do not see a point in moving it here.

A function to generate N (where N is a multiple of the Lanes) random numbers and return them in an alinged array.

This sounds useful :)

Implemented.

a BufferedXoshiro that generates N numbers at once and returns them one at the time. This can also be compatible with std::random but potentially faster as generates the numbers in one go and stores them until all consumed. I did not make this the default behavior as some platforms might have a limit on the memory available

This also sounds useful. As is often the case, portability requires dynamic allocation of N*Lanes(d) elements, you can choose how many vectors N to buffer.

Implemented!

FYI I am currently preparing for an conference and traveling tomorrow, hence have a bit higher response time.

No worries, I am also busy with wrapping up my PhD...

DiamonDinoia avatar Feb 07 '24 17:02 DiamonDinoia

Nice, thanks for making the updates! I like your NDArray idea.

CI has noticed an issue, looks like #include is required in aligned_allocator.

I could do it but I think is not good practice. Let me know if you want me to.

I trust you to, but have resolved the current ones. Got one more comment, a simplification/update to the Vec<> usage.

No worries, I am also busy with wrapping up my PhD...

Nice, best wishes for that! It is a good feeling to hand it in after a long time.

jan-wassenberg avatar Feb 09 '24 14:02 jan-wassenberg

Dear @jan-wassenberg

I renamed using T, V to something more appropriate as suggested.

CI has noticed an issue, looks like #include is required in aligned_allocator.

Thanks for pointing it out, addressed.

Nice, best wishes for that! It is a good feeling to hand it in after a long time.

Indeed it is! It was a long journey and I am looking forward to the next adventure!

DiamonDinoia avatar Feb 09 '24 17:02 DiamonDinoia

If this test pipeline passes, I need to integrate the tests in bazel ( I never used it) and document the code a bit, I need to check how rst works so that the comments appear in the official documentation.

DiamonDinoia avatar Feb 09 '24 18:02 DiamonDinoia

The tests have uncovered some issues:

AlignedVectorstd::size_t result(n);

Only [u]intXX_t are supported types for lanes. Technically size_t could be a different type, and certainly differs between platforms and is causing mismatches between u32 (which it is on Arm7) and the 64-bit we expect. How about we switch it to uint64_t?

For printf, there are also some errors. Unfortunately %lu is nonportable, what is required is the rather ugly PRIu64 outside of the string literal.

If you would like to run these tests locally, for far quicker turnaround times than our CI, here's how to run Arm7 (from run_tests.sh):

First install qemu-user-static g++-12-arm-linux-gnueabihf, then

export QEMU_LD_PREFIX=/usr/arm-linux-gnueabihf
rm -rf build_arm7 && mkdir build_arm7 && cd build_arm7
CC=arm-linux-gnueabihf-gcc-11 CXX=arm-linux-gnueabihf-g++-11 cmake .. -DHWY_CMAKE_ARM7:BOOL=ON -DHWY_WARNINGS_ARE_ERRORS:BOOL=ON -DCMAKE_BUILD_TYPE=Release
make -j && ctest -j && cd .. && rm -rf build_arm7

Adding the test to Bazel can be done by inserting a line into HWY_TESTS in the BUILD file: ("hwy/contrib/random/", "random_test"),

jan-wassenberg avatar Feb 12 '24 06:02 jan-wassenberg

HI @jan-wassenberg,

I did as you mentioned. I also tried to integrate random in bazel, from command line crypto test does not build but I am not sure if there is a problem with my changes as I have no experience.

There is something weird happening that might be fixed with a bunch of reinterpret_cast<> but I would like you to have a look because I am not sure that this is correct.

When I build for arm I see the following error:

/highway/hwy/contrib/random/random.h:205:41: error: cannot convert ‘long long unsigned int*’ to ‘hwy::N_NEON_WITHOUT_AES::TFromD<hwy::N_NEON_WITHOUT_AES::Simd<unsigned int, 4, 0> >*’ {aka ‘const unsigned int*’}
  205 |     auto s0 = Load(tag, state_[{0}].data());
      |                         ~~~~~~~~~~~~~~~~^~
      |                                         |
      |                                         long long unsigned int*

Is this intended? Is it going to work? Is packing 2x64 bits integers in 4x32. The size is okay but will the arithmetic work? For example, will the carry from the lower 32 bits be propagated to the higher 32?

Thanks, Marco

DiamonDinoia avatar Feb 13 '24 16:02 DiamonDinoia

I also tried to integrate random in bazel, from command line crypto test does not build

huh, the Bazel change looked like it should work, though we are overwriting the previous unroller library and should undo that. What's the build error you see?

When I build for arm I see the following error:

This is because we are mixing size_t and uint64_t. size_t is not a valid lane type; we should always use the fixed-size uintXX_t types. I've added comments at the three spots where I saw size_t lane types.

Is packing 2x64 bits integers in 4x32.

Note that this is generally possible if you really want to, via BitCast. But in this case I think we want everything to be 64-bit and we are only stumbling over the fact that size_t is not also 64-bit on Armv7.

jan-wassenberg avatar Feb 14 '24 11:02 jan-wassenberg

Dear @jan-wassenberg,

huh, the Bazel change looked like it should work, though we are overwriting the previous unroller library and should undo that. What's the build error you see?

$ bazel build //:all
ERROR: /data/highway/BUILD:503:16: in deps attribute of cc_test rule //:mask_test: target '//:unroller' does not exist. Since this rule was created by the macro 'cc_test', the error might have been caused by the macro implementation
ERROR: /data/highway/BUILD:503:16: Analysis of target '//:mask_test' failed
ERROR: Analysis of target '//:mask_test' failed; build aborted
INFO: Elapsed time: 0.135s, Critical Path: 0.00s
INFO: 0 processes.
ERROR: Build did NOT complete successfully

This is because we are mixing size_t and uint64_t. size_t is not a valid lane type; we should always use the fixed-size uintXX_t types. I've added comments at the three spots where I saw size_t lane types.

Note that this is generally possible if you really want to, via BitCast. But in this case I think we want everything to be 64-bit and we are only stumbling over the fact that size_t is not also 64-bit on Armv7.

You are correct. I thought I replaces all the size_t to uint64_t but I missed some. I apologize for wasting your time. This seems to work now. But, I get another error. I investigated a bit but I am not sure what is wrong. Could you have a look at this?

In file included from /homes/mb5119/data/highway/hwy/highway.h:432,
                 from /homes/mb5119/data/highway/hwy/contrib/random/random_test.cc:15:
/homes/mb5119/data/highway/hwy/ops/arm_neon-inl.h: In instantiation of ‘class hwy::N_NEON_WITHOUT_AES::Vec128<double, 2>’:
/homes/mb5119/data/highway/hwy/contrib/random/random.h:226:38:   required from here
/homes/mb5119/data/highway/hwy/ops/arm_neon-inl.h:803:9: error: invalid use of incomplete type ‘struct hwy::N_NEON_WITHOUT_AES::detail::Raw128<double, 2>’
  803 |   using Raw = typename detail::Raw128<T, N>::type;
      |         ^~~
/homes/mb5119/data/highway/hwy/ops/arm_neon-inl.h:602:8: note: declaration of ‘struct hwy::N_NEON_WITHOUT_AES::detail::Raw128<double, 2>’
  602 | struct Raw128;
      |        ^~~~~~
In file included from /homes/mb5119/data/highway/hwy/contrib/random/random_test.cc:16:
/homes/mb5119/data/highway/hwy/contrib/random/random.h: In member function ‘hwy::N_NEON_WITHOUT_AES::VectorXoshiro::VF64 hwy::N_NEON_WITHOUT_AES::VectorXoshiro::Uniform()’:
/homes/mb5119/data/highway/hwy/contrib/random/random.h:230:32: error: no matching function for call to ‘ConvertTo(hwy::N_NEON_WITHOUT_AES::ScalableTag<double>&, const hwy::N_NEON_WITHOUT_AES::Vec128<long long unsigned int, 2>&)’
  230 |     const auto real = ConvertTo(real_tag, bits);
      |                       ~~~~~~~~~^~~~~~~~~~~~~~~~
/homes/mb5119/data/highway/hwy/ops/arm_neon-inl.h:3862:23: note: candidate: ‘template<class D, hwy::EnableIf<IsSame<typename hwy::RemoveConstT<typename hwy::RemoveVolatileT<typename hwy::RemoveRefT<typename D::T>::type>::type>::type, float>()>* <anonymous> > hwy::N_NEON_WITHOUT_AES::Vec128<float, 4> hwy::N_NEON_WITHOUT_AES::ConvertTo(D, Vec128<int, 4>)’
 3862 | HWY_API Vec128<float> ConvertTo(D /* tag */, Vec128<int32_t> v) {
      |                       ^~~~~~~~~
/homes/mb5119/data/highway/hwy/ops/arm_neon-inl.h:3862:23: note:   template argument deduction/substitution failed:
In file included from /homes/mb5119/data/highway/hwy/highway.h:19,
                 from /homes/mb5119/data/highway/hwy/contrib/random/random_test.cc:15,
                 from /homes/mb5119/data/highway/hwy/foreach_target.h:327,
                 from /homes/mb5119/data/highway/hwy/contrib/random/random_test.cc:14:
/homes/mb5119/data/highway/hwy/base.h: In substitution of ‘template<bool Condition> using EnableIf = typename hwy::EnableIfT::type [with bool Condition = false]’:
/homes/mb5119/data/highway/hwy/ops/arm_neon-inl.h:3861:20:   required from here
/homes/mb5119/data/highway/hwy/base.h:472:7: error: no type named ‘type’ in ‘struct hwy::EnableIfT<false>’
  472 | using EnableIf = typename EnableIfT<Condition>::type;

Thanks, Marco

DiamonDinoia avatar Feb 21 '24 16:02 DiamonDinoia

Nice, u64 change looks good. No worries :)

Don't worry about the Bazel issue with :unroller, this is unrelated and we'll take care of it.

The remaining issue is that Armv7 does not support double, so we have to gate any usages of it behind #if HWY_HAVE_FLOAT64. Sorry I didn't realize this during review.

jan-wassenberg avatar Feb 22 '24 03:02 jan-wassenberg

Dear @jan-wassenberg,

It seems good now.

Thanks, Marco

DiamonDinoia avatar Feb 22 '24 11:02 DiamonDinoia

The 'unroller' issue/comment above is still causing internal CI to fail :)

jan-wassenberg avatar Feb 24 '24 04:02 jan-wassenberg

Thanks for updating :) Yikes, a long series of comments because we are seeing issues one by one as the CI raises them. #include <hwy/*> must actually be #include "hwy/*", sorry I did not catch this during review :/

jan-wassenberg avatar Feb 27 '24 04:02 jan-wassenberg

Thanks for updating :) Yikes, a long series of comments because we are seeing issues one by one as the CI raises them. #include <hwy/*> must actually be #include "hwy/*", sorry I did not catch this during review :/

Fixed now. Let's see if CI is happy :)

DiamonDinoia avatar Feb 27 '24 17:02 DiamonDinoia

Oh, now it looks like libstdc++ is shooting us down :)

uniform_real_distribution.h:102:17: error: static assertion failed due to requirement '__libcpp_random_is_valid_urng<hwy::N_EMU128::CachedXoshiro<1024>, void>::value': 
  102 |   static_assert(__libcpp_random_is_valid_urng<_URNG>::value, "");

Probably we're missing one of the required typedefs/member functions. Would you like to compare with absl? We can ignore all of the PascalCase members.

jan-wassenberg avatar Feb 28 '24 03:02 jan-wassenberg

Oh, now it looks like libstdc++ is shooting us down :)

uniform_real_distribution.h:102:17: error: static assertion failed due to requirement '__libcpp_random_is_valid_urng<hwy::N_EMU128::CachedXoshiro<1024>, void>::value': 
  102 |   static_assert(__libcpp_random_is_valid_urng<_URNG>::value, "");

Probably we're missing one of the required typedefs/member functions. Would you like to compare with absl? We can ignore all of the PascalCase members.

It seems that result_type is needed. I completely forgot about it as it is not a requirement from c++23 onwards.

DiamonDinoia avatar Feb 28 '24 17:02 DiamonDinoia

Interesting, I did not know it is no longer required. Today's harvest: 😿

  1. wasm is throwing an exception, perhaps in an infinite loop hence stack overrun:
[ RUN      ] HwyRandomTestGroup/HwyRandomTest.TestNextFixedNRandomUint64/EMU128
Aborted(stack overflow (Attempt to set SP to 0xfffcff10, with stack limits [0x00000000 - 0x00010000]). If you require more stack space build with -sSTACK_SIZE=<bytes>)
/wasm_node_wrapper/patched_random_test.js:226
      throw ex;
MSVC\14.28.29910\include\vector(682): error C2440: 'static_cast': cannot convert from 'hwy::AlignedAllocator<double>' to 'hwy::AlignedAllocator<_Newfirst>'
        with
        [
            _Newfirst=std::_Container_proxy
        ]
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(682): note: No constructor could take the source type, or constructor overload resolution was ambiguous
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(679): note: while compiling class template member function 'std::vector<double,hwy::AlignedAllocator<double>>::~vector(void) noexcept'
z:\build\work\1ba0c368315dfa93895661bd44ff5a541c28\google3\third_party/highway/hwy/contrib/random/random-inl.h(257): note: see reference to function template instantiation 'std::vector<double,hwy::AlignedAllocator<double>>::~vector(void) noexcept' being compiled
z:\build\work\1ba0c368315dfa93895661bd44ff5a541c28\google3\third_party/highway/hwy/contrib/random/random-inl.h(256): note: see reference to class template instantiation 'std::vector<double,hwy::AlignedAllocator<double>>' being compiled
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(682): error C2530: '_Alproxy': references must be initialized
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(683): error C3536: '_Alproxy': cannot be used before it is initialized
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(683): error C2672: '_Delete_plain_internal': no matching overloaded function found
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(683): error C2893: Failed to specialize function template 'void std::_Delete_plain_internal(_Alloc &,_Alloc::value_type *const ) noexcept'
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\xmemory(1025): note: see declaration of 'std::_Delete_plain_internal'
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(683): note: With the following template arguments:
e:\google3\third_party\unsupported_toolchains\msvc\vc_16\files\vc\Tools\MSVC\14.28.29910\include\vector(683): note: '_Alloc=int'
ERROR: third_party/highway/BUILD:534:16: Compiling third_party/highway/hwy/contrib/random/random_test.cc failed

That line is:

    ~vector() noexcept {
        _Tidy();
#if _ITERATOR_DEBUG_LEVEL != 0
        auto&& _Alproxy = _GET_PROXY_ALLOCATOR(_Alty, _Getal());   // <<--

I think this is usually fixed by providing a rebind<> member of allocators. Are there any other typedefs etc required for allocators?

jan-wassenberg avatar Feb 29 '24 12:02 jan-wassenberg

Hi,

  1. The issue is that I use a std::array on that test. std::array allocates on the stack. I reduced the size of the array by reducing the number of tests.
  2. According to this: https://cplusplus.com/reference/memory/allocator_traits/ what is missing are the == and != operators. std::allocator_traits should take care of the remaining boilerplate.

I am not too sure about 2, I can investigate a bit but it might take some time for me.

Thanks, Marco

DiamonDinoia avatar Feb 29 '24 17:02 DiamonDinoia

Nice, thanks for updating :) The current issue we are seeing is the extraneous ; after {} on line 136, which actually fails the build due to our -Werror.

jan-wassenberg avatar Mar 04 '24 08:03 jan-wassenberg

Nice, thanks for updating :) The current issue we are seeing is the extraneous ; after {} on line 136, which actually fails the build due to our -Werror.

Fixed.

DiamonDinoia avatar Mar 04 '24 19:03 DiamonDinoia

Thanks for your patience and persisting until the end! :tada:

DiamonDinoia avatar Mar 05 '24 10:03 DiamonDinoia