String API additions and SIMD optimizations

Open mosra opened this issue 3 years ago • 0 comments

A meta-issue tracking various ideas for SIMD-optimized string algorithms. A reason why we're making our own string APIs is because the C string library is made for null-terminated strings, which is quite useless when the major use case is working on slices of larger strings, such as in parsers. And (of course) the C++ counterparts are too bloated and either impose an implicit allocation or require a too new C++ standard.

SIMD in general

[x] Needs #115 to be merged
- [x] ~~Figure out how to dispatch~~ handled in #115 as well

Construction

[ ] Add a way to automatically mark a view as Global if it's a C string literal (using __constant_string_p() compiler builtin, details in #123)
[ ] Add a way to automatically mark an arbitrary const char* as Global by checking if the pointer is in a read-only memory page -- https://github.com/RobloxResearch/SIMDString/blob/main/SIMDString.cpp#L54-L75

Searching

[ ] SIMD-optimized memchr(), memchr2(), memchr3() etc., used in find(char), findAnyOf(), split() etc. for quickly skipping until end of line or end of a JSON string -- https://docs.rs/memchr/2.3.4/src/memchr/x86/avx.rs.html
- [ ] There's also a SSE 4.2 implementation here: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (the null-terminated aspect is optional in the instructions, I hope?)
  - [ ] "But yes, any strcmp/strlen style function that uses SSE/AVX/etc. without a buffer size parameter, must first read in smaller units until it gets to a 16/32 byte aligned address, after which it can start using SSE/AVX safely without having to worry about faulting even if it reads past the end of the string." means I wouldn't need to have the tail part, unlike the Rust implementation? needs significant amount of testing tho, and probably could still trigger ASan with growable arrays, ARM MTE etc.
- [ ] Possibly an inverse (findAnyNotOf()) when there's a lot of candidates and memchr13() would be impossible to vectorize ("find the first that's not a space or newline and then decide what to do next")
- [ ] Movemask isn't on ARM, but https://twitter.com/Danlark1/status/1539344315356520450 should be a better option than emulating it
- [ ] The scalar fallback (and less-than-a-vector fallback?) could be optimized to use Duff's Device or SWAR: https://lore.kernel.org/lkml/[email protected]/T/#u
- [ ] Anything useful in https://github.com/ashvardanian/Stringzilla?
- [ ] There's also an AVX512 variant in https://github.com/Kraionix/ImGuiMemchrBench/blob/master/ImMemchrBench/immemchr.h
[ ] memrchr() and variants of above for findLastAnyOf(). Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.
[ ] memmem() for faster general find() because of course strstr() is useless and I think we could do better than memcmp() in a loop. Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.
- [ ] and memrmem() for findLast()
- [ ] if ever needed, a STL-less alternative to std::boyer_moore_searcher (but this one needs a lot of preprocessing state)
  - https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm
  - Exact Packed String Matching (2013) is the best?
  - https://smart-tool.github.io/smart/
  - check https://github.com/davidesantangelo/krep for anything useful (it picks SSE4.2 for certain search sizes, AVX2 is also advertised but not actually present)
[ ] Levenshtein distance ("did you mean ...") implementation finding the closest string from a list when Arguments, plugin name, ... is mistyped --https://github.com/ninja-build/ninja/blob/ca041d88f4d610332aa48c801342edfafb622ccb/src/edit_distance.cc#L20-L69

Comparison

[ ] Check pointer equality before calling into memcmp() in StringView::operator==(), could save a lot especially when comparing literals that the compiler might have deduplicated
- Don't do that in String tho
[ ] Would we gain anything by implementing memcmp() ourselves?
- especially for SSO strings that have a fixed size, which could be a single (masked) instruction?
- by not having to explicitly test for nullptr when size == 0 just to not hit an UB because the standard is stupid and generally disallows passing nullptr to any string/memory function even if the size is zero?
[ ] Case-insensitive comparison -- http://www.phoronix.com/scan.php?page=news_item&px=Glibc-strcasecmp-AVX2-EVEX
- Possibly useful for extension comparison in Any* plugins, OTOH there it's probably faster to normalize the extension first and then do 100 memcmp()s

Unicode

[ ] UTF-16 surrogates to UTF-8 and back, for JSON \u parsing
[ ] SIMD-ified UTF-8 codepoint counting and validation -- https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
[ ] UTF-8 character counting including counting decomposed characters (such as a and ¨ as one), needs some Unicode categorization table
- does https://github.com/ridiculousfish/widecharwidth work like that or not?

Number-to-string

For Utility::Debug, Utility::format() etc. The core should be a direct overhead-less API working on builtin types (writing into a statically-sized char[], e.g.), with convenience wrappers above.

[ ] integer-to-string conversion instead of stupidly using printf() -- even a dumb while loop would be better
- https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/
- https://github.com/jeaiii/itoa (and explanation from https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/), doesn't seem to use any explicit SIMD?
[ ] separate optimized hex, bin and oct variant
- https://lemire.me/blog/2021/05/17/converting-binary-integers-to-ascii-characters-apple-m1-vs-amd-zen2/
- http://0x80.pl/articles/convert-to-hex.html
[ ] float-to-string conversion instead of abusing printf()
- [ ] Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone?
  - MSVC to_chars used Ryu, not sure if it's still: https://twitter.com/StephanTLavavej/status/992881364142702593
  - Older benchmarks here: https://github.com/miloyip/dtoa-benchmark
- [ ] Explicit option to write or fail on NaN and Inf (for JSON compat)
- [ ] Uppercase or lowercase exponent
- [ ] Option to write hex floats (or a separate function that doesn't need the complex algo?)
[ ] Any easy gains with a vectorized variant? "Convert 8 numbers to 8 strings at once"
- Useful for JSON, OBJ, ... export

String-to-number

Because strto*() has insane usability issues.

[ ] string-to-integer conversion instead of stupidly using strtol() -- even a dumb while loop could be better
- have 8-, 16-, 32- bit variants that check the range on their own
- remove the stupid handling of - for unsigned types (who thought that was a good idea??)
- https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
- https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html and the lengthy post linked from there
[ ] separate optimized hex, bin and oct variant
- http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html
[ ] string-to-float conversion instead of using the shitty & slow strtof() / strtod()
- [ ] What's the algorithm here? Some pointers could be in comments at https://lemire.me/blog/2020/09/10/parsing-floats-in-c-benchmarking-strtod-vs-from_chars/
- [ ] Explicit option to parse or fail on NaN and Inf (for JSON compat)
- [ ] Explicit option to write or ignore hex floats (or a separate function that doesn't need the complex algo?)
- [ ] Also watch out for accumulated rounding errors when going to float through a double: https://www.exploringbinary.com/double-rounding-errors-in-decimal-to-double-to-float-conversions/
[ ] Any easy gains with a vectorized variant? "Convert 8 strings to 8 numbers at once"
- Useful for JSON, OBJ, ... parsing

General printing

Is some printf compatibility desired? like https://github.com/tfc/pprintpp
Optimization ideas from Rust: https://internals.rust-lang.org/t/changing-core-fmt-for-speed/5154

Mar 31 '22 09:03 mosra