corrade
corrade copied to clipboard
String API additions and SIMD optimizations
A meta-issue tracking various ideas for SIMD-optimized string algorithms. A reason why we're making our own string APIs is because the C string library is made for null-terminated strings, which is quite useless when the major use case is working on slices of larger strings, such as in parsers. And (of course) the C++ counterparts are too bloated and either impose an implicit allocation or require a too new C++ standard.
SIMD in general
- [x] Needs #115 to be merged
- [x] ~~Figure out how to dispatch~~ handled in #115 as well
Construction
- [ ] Add a way to automatically mark a view as
Globalif it's a C string literal (using__constant_string_p()compiler builtin, details in #123) - [ ] Add a way to automatically mark an arbitrary
const char*asGlobalby checking if the pointer is in a read-only memory page -- https://github.com/RobloxResearch/SIMDString/blob/main/SIMDString.cpp#L54-L75
Searching
- [ ] SIMD-optimized
memchr(),memchr2(),memchr3()etc., used infind(char),findAnyOf(),split()etc. for quickly skipping until end of line or end of a JSON string -- https://docs.rs/memchr/2.3.4/src/memchr/x86/avx.rs.html- [ ] There's also a SSE 4.2 implementation here: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (the null-terminated aspect is optional in the instructions, I hope?)
- [ ] "But yes, any strcmp/strlen style function that uses SSE/AVX/etc. without a buffer size parameter, must first read in smaller units until it gets to a 16/32 byte aligned address, after which it can start using SSE/AVX safely without having to worry about faulting even if it reads past the end of the string." means I wouldn't need to have the tail part, unlike the Rust implementation? needs significant amount of testing tho, and probably could still trigger ASan with growable arrays, ARM MTE etc.
- [ ] Possibly an inverse (
findAnyNotOf()) when there's a lot of candidates andmemchr13()would be impossible to vectorize ("find the first that's not a space or newline and then decide what to do next") - [ ] Movemask isn't on ARM, but https://twitter.com/Danlark1/status/1539344315356520450 should be a better option than emulating it
- [ ] The scalar fallback (and less-than-a-vector fallback?) could be optimized to use Duff's Device or SWAR: https://lore.kernel.org/lkml/[email protected]/T/#u
- [ ] Anything useful in https://github.com/ashvardanian/Stringzilla?
- [ ] There's also an AVX512 variant in https://github.com/Kraionix/ImGuiMemchrBench/blob/master/ImMemchrBench/immemchr.h
- [ ] There's also a SSE 4.2 implementation here: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (the null-terminated aspect is optional in the instructions, I hope?)
- [ ]
memrchr()and variants of above forfindLastAnyOf(). Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box. - [ ]
memmem()for faster generalfind()because of coursestrstr()is useless and I think we could do better thanmemcmp()in a loop. Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.- [ ] and
memrmem()forfindLast() - [ ] if ever needed, a STL-less alternative to
std::boyer_moore_searcher(but this one needs a lot of preprocessing state)- https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm
- Exact Packed String Matching (2013) is the best?
- https://smart-tool.github.io/smart/
- check https://github.com/davidesantangelo/krep for anything useful (it picks SSE4.2 for certain search sizes, AVX2 is also advertised but not actually present)
- [ ] and
- [ ] Levenshtein distance ("did you mean ...") implementation finding the closest string from a list when
Arguments, plugin name, ... is mistyped --https://github.com/ninja-build/ninja/blob/ca041d88f4d610332aa48c801342edfafb622ccb/src/edit_distance.cc#L20-L69
Comparison
- [ ] Check pointer equality before calling into
memcmp()inStringView::operator==(), could save a lot especially when comparing literals that the compiler might have deduplicated- Don't do that in
Stringtho
- Don't do that in
- [ ] Would we gain anything by implementing
memcmp()ourselves?- especially for SSO strings that have a fixed size, which could be a single (masked) instruction?
- by not having to explicitly test for
nullptrwhensize == 0just to not hit an UB because the standard is stupid and generally disallows passingnullptrto any string/memory function even if the size is zero?
- [ ] Case-insensitive comparison -- http://www.phoronix.com/scan.php?page=news_item&px=Glibc-strcasecmp-AVX2-EVEX
- Possibly useful for extension comparison in
Any*plugins, OTOH there it's probably faster to normalize the extension first and then do 100memcmp()s
- Possibly useful for extension comparison in
Unicode
- [ ] UTF-16 surrogates to UTF-8 and back, for JSON
\uparsing - [ ] SIMD-ified UTF-8 codepoint counting and validation -- https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
- [ ] UTF-8 character counting including counting decomposed characters (such as
aand¨as one), needs some Unicode categorization table- does https://github.com/ridiculousfish/widecharwidth work like that or not?
Number-to-string
For Utility::Debug, Utility::format() etc. The core should be a direct overhead-less API working on builtin types (writing into a statically-sized char[], e.g.), with convenience wrappers above.
- [ ] integer-to-string conversion instead of stupidly using
printf()-- even a dumbwhileloop would be better- https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/
- https://github.com/jeaiii/itoa (and explanation from https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/), doesn't seem to use any explicit SIMD?
- [ ] separate optimized hex, bin and oct variant
- https://lemire.me/blog/2021/05/17/converting-binary-integers-to-ascii-characters-apple-m1-vs-amd-zen2/
- http://0x80.pl/articles/convert-to-hex.html
- [ ] float-to-string conversion instead of abusing
printf()- [ ] Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone?
- MSVC
to_charsused Ryu, not sure if it's still: https://twitter.com/StephanTLavavej/status/992881364142702593 - Older benchmarks here: https://github.com/miloyip/dtoa-benchmark
- MSVC
- [ ] Explicit option to write or fail on NaN and Inf (for JSON compat)
- [ ] Uppercase or lowercase exponent
- [ ] Option to write hex floats (or a separate function that doesn't need the complex algo?)
- [ ] Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone?
- [ ] Any easy gains with a vectorized variant? "Convert 8 numbers to 8 strings at once"
- Useful for JSON, OBJ, ... export
String-to-number
Because strto*() has insane usability issues.
- [ ] string-to-integer conversion instead of stupidly using
strtol()-- even a dumbwhileloop could be better- have 8-, 16-, 32- bit variants that check the range on their own
- remove the stupid handling of
-for unsigned types (who thought that was a good idea??) - https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
- https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html and the lengthy post linked from there
- [ ] separate optimized hex, bin and oct variant
- http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html
- [ ] string-to-float conversion instead of using the shitty & slow
strtof()/strtod()- [ ] What's the algorithm here? Some pointers could be in comments at https://lemire.me/blog/2020/09/10/parsing-floats-in-c-benchmarking-strtod-vs-from_chars/
- [ ] Explicit option to parse or fail on NaN and Inf (for JSON compat)
- [ ] Explicit option to write or ignore hex floats (or a separate function that doesn't need the complex algo?)
- [ ] Also watch out for accumulated rounding errors when going to float through a double: https://www.exploringbinary.com/double-rounding-errors-in-decimal-to-double-to-float-conversions/
- [ ] Any easy gains with a vectorized variant? "Convert 8 strings to 8 numbers at once"
- Useful for JSON, OBJ, ... parsing
General printing
- Is some printf compatibility desired? like https://github.com/tfc/pprintpp
- Optimization ideas from Rust: https://internals.rust-lang.org/t/changing-core-fmt-for-speed/5154