StringZilla icon indicating copy to clipboard operation
StringZilla copied to clipboard

Added windows cross compile builds & fixed build issues

Open ashbob999 opened this issue 1 year ago • 20 comments

  • Added cross compiling builds (x64, x86, ARM64) for windows (MSVC), and fixed build issues.
  • Added support for using MSVC ARM intrinsics.
  • Added tests for sz_u32_clz which will help catch if the tests are run without BMI support.
  • Fixed build issues when building MacOS Universal2 binaries.
  • Disabled AVX for 32-bit binaries.
  • Fixed lots of issues when building a Windows 32-bit binary.
  • Updated release.yml to build x86/x64/arm64 windows versions, and included the .lib file in the archive.
  • Added checks for making sure stringzillite is built without any dependencies.
  • Fixed #185

ashbob999 avatar Sep 15 '24 16:09 ashbob999

@ashvardanian For the x86 (32-bit) builds, there are warnings about conversion loss of data, due to converison between sz_size_t and sz_u64_t.

sz_sorted_idx_t is defined as a sz_u64_t.

Functions where this happens: sz_partition, sz_merge, sz_sort_introsort_recursion, sz_sort_partial.

Plus there are other places where the sz_u64_t are used directly, but only to be converted to sz_size_t.

https://github.com/ashvardanian/StringZilla/blob/a34d8365de12a0820f025eb7aedef73a8f911292/include/stringzilla/stringzilla.h#L3478-L3487

Other functions it occurs in: _sz_sift_down, sz_sort_introsort_recursion, sz_string_unpack.

So should all of these be changed to use sz_size_t?

Also should we be running the NumPy tests when building the python binaries?

ashbob999 avatar Sep 21 '24 18:09 ashbob999

Hey @ashbob999! Those places require the unsigned integer to be 64 bit, so it shouldn't be changed.

ashvardanian avatar Sep 21 '24 18:09 ashvardanian

Hey @ashbob999! Those places require the unsigned integer to be 64 bit, so it shouldn't be changed.

Although, if I have understood the sz_sort_insertion function correctly and the sz_sequence_t struct.

That the order member is a pointer to an array of indexes, which specify the sorted order, and the count is the number of indexes in the array. Then don't we already have a mismatch, because we have a uint32 for the count and a uint64 for the index of each string?

And because the order pointer array is populated with the data ptr from a std::vector, which the vector max size is 32-bit.

ashbob999 avatar Sep 21 '24 18:09 ashbob999

@ashvardanian what should I do about the warnings for the 32-builds (for the sorting functions)?

ashbob999 avatar Oct 11 '24 15:10 ashbob999

@ashbob999, what's the cleanest way to reproduce some of those warnings/errors with GCC/Clang? It would be easier to choose a path forward if I can better understand their nature.

ashvardanian avatar Oct 11 '24 15:10 ashvardanian

@ashvardanian

Just building it as 32-bit (m32 and adding -Wconversion although this adds other warnings for clang) should be enough. Although I cannot seem to reproduce some of the warnings that MSVC emits.

For example the warning about converting the result of sz_blend_u64 doesn't get emitted even though it is truncating from u64 to u32.

https://github.com/ashvardanian/StringZilla/blob/dd2b94940b06e8bb967b1b53177ce8505757e2f0/include/stringzilla/stringzilla.h#L3162

But only a very few actually get emitted (GCC).

[1/2] Building CXX object CMakeFiles/stringzilla_test_cpp14.dir/scripts/test.cpp.o
In file included from /usr/include/c++/11/cassert:44,
                 from ../include/stringzilla/stringzilla.hpp:57,
                 from ../scripts/test.cpp:21:
../scripts/test.cpp: In function ‘void test_sequence_algorithms()’:
../scripts/test.cpp:1347:90: warning: conversion from ‘__gnu_cxx::__alloc_traits<std::allocator<long long unsigned int>, long long unsigned int>::value_type’ {aka ‘long long unsigned int’} to ‘std::vector<std::__cxx11::basic_string<char> >::size_type’ {aka ‘unsigned int’} may change value [-Wconversion]
 1347 |             for (std::size_t i = 1; i != dataset_size; ++i) { assert(dataset[order[i - 1]] <= dataset[order[i]]); }
      |                                                                                          ^
../scripts/test.cpp:1347:111: warning: conversion from ‘__gnu_cxx::__alloc_traits<std::allocator<long long unsigned int>, long long unsigned int>::value_type’ {aka ‘long long unsigned int’} to ‘std::vector<std::__cxx11::basic_string<char> >::size_type’ {aka ‘unsigned int’} may change value [-Wconversion]
 1347 |             for (std::size_t i = 1; i != dataset_size; ++i) { assert(dataset[order[i - 1]] <= dataset[order[i]]); }
      |                                                                                                               ^
[2/2] Linking CXX executable stringzilla_test_cpp14

But for whatever reason (maybe my building of the 32-bit with GCC/Clang is broken) but if I build it as 64-bit but with sz_size_t and sz_ssize_t changed to be uint32_t and int32_t respectively, then I get the warnings that MSVC emitted.

ashbob999 avatar Oct 11 '24 21:10 ashbob999

This PR looks great, thank you @ashbob999! Give me a bit of time to test it 🤗

ashvardanian avatar Oct 12 '24 21:10 ashvardanian

@ashvardanian apart from the issue with the warnings, I have discovered that because we are doing 64-bit integer maths, that some of these calculations have to be emulated. But when we build stringzillite we remove the MSVC crt library which provides us with these emulations we then get linker errors (undefined reference to __aullshr/__aullrem/__allshl/__allmul).

Getting the 32-bit fully building is becoming a bit painful, but if it is still needed there are some ways to fix it:

  • In certain places like sz_u64_ctz we could manually split the 64-bit value into 2 32-bit values (although this issue might go away if the next option was implemented).
  • For some other places 64-bit SWAR is being used, so this may have to be changed to use 32-bit SWAR.
  • Or we could use MMX instructions, but that may be worse, but converting to/from might not be pretty.

ashbob999 avatar Oct 12 '24 22:10 ashbob999

I think the first option, emulating ourselves, may be the best. In the next major release, StringZilla is getting a few new algos, and I want it to be a complete superset/extension of STL, so it would be great if we can avoid all dependencies 🤗

ashvardanian avatar Oct 12 '24 22:10 ashvardanian

Although would emulating the 64-bit operations, in places where 64-bit SWAR is used, not be slower than just using 32-bit SWAR?

ashbob999 avatar Oct 12 '24 22:10 ashbob999

That's a valid concern, @ashbob999. Still the gains from 4-byte SWAR would be laughable, I believe. So we should probably just stick to serial code in such places. Wyt?

ashvardanian avatar Oct 13 '24 01:10 ashvardanian

@ashvardanian these are my results after some quick & rough benchmarking of the few some different sz_fill functions:

memset 18.6266 GB/s 3602851.1 ns 6940 iterations
sz_fill_serial 3.5075 GB/s 19132826.8 ns 1308 iterations current version
sz_fill_serial_swar32 8.1238 GB/s 8260806.9 ns 3028 iterations uses 32-bit SWAR (with msvc using memset for some of the loops)
sz_fill_serial_no_deps 7.9455 GB/s 8446142.8 ns 2960 iterations uses 32-bit SWAR and switch to handle the remaining bytes
sz_fill_avx2 9.9660 GB/s GB/s 6733786.2 ns 3716 iterations I thought this would be much faster that the serial/SWAR
sz_fill_serial_single_instruction 16.8512 GB/s GB/s 3982446.4 ns 6280 iterations Just using a single call to __stosb which becomes rep stosb (just for comparison only, but internally does it's own form of dynamic dispatch for what the cpu supports)

These were done on an i5 6500t, and i got the same results for GCC. Using the entire leipzig1m dataset. When running the 32-bit version, although the 64-bit results are almost identical.

I'm not entirely sure why my tests of serial_no_deps, swar32 and avx2 are much slower than memset. Because in the test results you put on the latest release they were all very similar. Maybe because if you built them with AVX512 enabled the compiler has emitted AVX512 instructions into the serial/avx2 functions (maybe we need to change the build steps for the benchmarks to avoid this).

So I would be happy to go with the sz_fill_serial_no_deps (SWAR 32 with switch), for MSVC at least, and we can keep the while loops for GCC/Clang.

ashbob999 avatar Oct 13 '24 16:10 ashbob999

@ashbob999, great! For basic operations like sz_fill we can add and #if condition with a 32-bit variant inside. I'd avoid creating separate function definitions. sz_copy and sz_move may benefit as well. Any other APIs on your mind?

ashvardanian avatar Oct 13 '24 17:10 ashvardanian

@ashvardanian okay, that's what I was going to do. As for other APIs we'll see once I fix sz_fill and other similar ones.

As for the 32-bit warnings, should we adjust them, and say there is a 4 billion limit for sorting?

ashbob999 avatar Oct 13 '24 18:10 ashbob999

I believe it's already documented, that sorting over 4B entries is not currently supported. If not, we should mark that.

ashvardanian avatar Oct 13 '24 18:10 ashvardanian

@ashvardanian that's fine, so it should then be fine to change the size of the values in the sorting functions to sz_size_t, because do correct me if i'm wrong but i don't remember seeing any logic that explicitly required 64-bit integer (as they are just used for indexes).

ashbob999 avatar Oct 13 '24 19:10 ashbob999

@ashvardanian, Somehow I missed that for sz_sort_partial we are adding the first 4 bytes from each string into the upper half of the sz_u64_t. Which would make sense why it's limited to 4 billion (although I couldn't find anything that said it is limited) so it needs to be a sz_u64_t.

Would it make sense to change the type of count in sz_sequence_t from a sz_size_t to sz_u32_t as on 64-bit platforms giving more would cause it to behave incorrectly. Although this would break ABI with previous versions.

Can you think of any better way of forcing the 4B limit on the user?

ashbob999 avatar Oct 20 '24 14:10 ashbob999

Can you think of any better way of forcing the 4B limit on the user?

I would stick to size_t for sizes and focus on designing a better sorting algorithm. Long-term sustainable solutions are better than short-term gains 😉

ashvardanian avatar Oct 22 '24 13:10 ashvardanian

Hey @ashbob999! I'll have more time to look into this next week.

I want to thank you for following the git message style - it's very pleasing to see! Unlike most PRs, where I end up squashing the patches and merging under a new name, I would love to preserve your entire commit history. I'd just ask you to reformat/squash the last three commits that deviate from the style. Thanks again!

ashvardanian avatar Oct 30 '24 10:10 ashvardanian

@ashvardanian no problem, I purposely did not make them that way jsut because they were fixing issues that I noticed from previous commits in the merge.

I can squash them into a single one, labelled as minor fixes (or equivalent), do you also want me to rebate to fix one in the middle a35cc50903d2a3e41b3ecac0559aadc950825ff1?

ashbob999 avatar Oct 30 '24 10:10 ashbob999

⚠️ Only 5 files will be analyzed due to processing limits.

recurseml[bot] avatar May 02 '25 15:05 recurseml[bot]

⚠️ Only 5 files will be analyzed due to processing limits.

recurseml[bot] avatar May 02 '25 15:05 recurseml[bot]

😱 Found 4 issues. Time to roll up your sleeves! 😱

recurseml[bot] avatar May 02 '25 15:05 recurseml[bot]

Hi @ashbob999! I've tried to incorporate as many of your suggestions as possible into the v4 release a couple of weeks ago, but I'm not sure how many I've managed to merge. I'm closing this PR as it's too divergent from the current state, but I wanted to thank you again for all your efforts! If there are minor changes we can incorporate into the new build, please let me know 🤗

ashvardanian avatar Oct 02 '25 18:10 ashvardanian