corrade icon indicating copy to clipboard operation
corrade copied to clipboard

String API additions and SIMD optimizations

Open mosra opened this issue 3 years ago • 0 comments

A meta-issue tracking various ideas for SIMD-optimized string algorithms. A reason why we're making our own string APIs is because the C string library is made for null-terminated strings, which is quite useless when the major use case is working on slices of larger strings, such as in parsers. And (of course) the C++ counterparts are too bloated and either impose an implicit allocation or require a too new C++ standard.

SIMD in general

  • [x] Needs #115 to be merged
    • [x] ~~Figure out how to dispatch~~ handled in #115 as well

Construction

  • [ ] Add a way to automatically mark a view as Global if it's a C string literal (using __constant_string_p() compiler builtin, details in #123)
  • [ ] Add a way to automatically mark an arbitrary const char* as Global by checking if the pointer is in a read-only memory page -- https://github.com/RobloxResearch/SIMDString/blob/main/SIMDString.cpp#L54-L75

Searching

  • [ ] SIMD-optimized memchr(), memchr2(), memchr3() etc., used in find(char), findAnyOf(), split() etc. for quickly skipping until end of line or end of a JSON string -- https://docs.rs/memchr/2.3.4/src/memchr/x86/avx.rs.html
    • [ ] There's also a SSE 4.2 implementation here: https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (the null-terminated aspect is optional in the instructions, I hope?)
      • [ ] "But yes, any strcmp/strlen style function that uses SSE/AVX/etc. without a buffer size parameter, must first read in smaller units until it gets to a 16/32 byte aligned address, after which it can start using SSE/AVX safely without having to worry about faulting even if it reads past the end of the string." means I wouldn't need to have the tail part, unlike the Rust implementation? needs significant amount of testing tho, and probably could still trigger ASan with growable arrays, ARM MTE etc.
    • [ ] Possibly an inverse (findAnyNotOf()) when there's a lot of candidates and memchr13() would be impossible to vectorize ("find the first that's not a space or newline and then decide what to do next")
    • [ ] Movemask isn't on ARM, but https://twitter.com/Danlark1/status/1539344315356520450 should be a better option than emulating it
    • [ ] The scalar fallback (and less-than-a-vector fallback?) could be optimized to use Duff's Device or SWAR: https://lore.kernel.org/lkml/[email protected]/T/#u
    • [ ] Anything useful in https://github.com/ashvardanian/Stringzilla?
    • [ ] There's also an AVX512 variant in https://github.com/Kraionix/ImGuiMemchrBench/blob/master/ImMemchrBench/immemchr.h
  • [ ] memrchr() and variants of above for findLastAnyOf(). Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.
  • [ ] memmem() for faster general find() because of course strstr() is useless and I think we could do better than memcmp() in a loop. Linux has it, not sure where to look for a non-tainted implementation. At least we can benchmark against it as a black box.
    • [ ] and memrmem() for findLast()
    • [ ] if ever needed, a STL-less alternative to std::boyer_moore_searcher (but this one needs a lot of preprocessing state)
      • https://stackoverflow.com/questions/3183582/what-is-the-fastest-substring-search-algorithm
      • Exact Packed String Matching (2013) is the best?
      • https://smart-tool.github.io/smart/
      • check https://github.com/davidesantangelo/krep for anything useful (it picks SSE4.2 for certain search sizes, AVX2 is also advertised but not actually present)
  • [ ] Levenshtein distance ("did you mean ...") implementation finding the closest string from a list when Arguments, plugin name, ... is mistyped --https://github.com/ninja-build/ninja/blob/ca041d88f4d610332aa48c801342edfafb622ccb/src/edit_distance.cc#L20-L69

Comparison

  • [ ] Check pointer equality before calling into memcmp() in StringView::operator==(), could save a lot especially when comparing literals that the compiler might have deduplicated
    • Don't do that in String tho
  • [ ] Would we gain anything by implementing memcmp() ourselves?
    • especially for SSO strings that have a fixed size, which could be a single (masked) instruction?
    • by not having to explicitly test for nullptr when size == 0 just to not hit an UB because the standard is stupid and generally disallows passing nullptr to any string/memory function even if the size is zero?
  • [ ] Case-insensitive comparison -- http://www.phoronix.com/scan.php?page=news_item&px=Glibc-strcasecmp-AVX2-EVEX
    • Possibly useful for extension comparison in Any* plugins, OTOH there it's probably faster to normalize the extension first and then do 100 memcmp()s

Unicode

  • [ ] UTF-16 surrogates to UTF-8 and back, for JSON \u parsing
  • [ ] SIMD-ified UTF-8 codepoint counting and validation -- https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
  • [ ] UTF-8 character counting including counting decomposed characters (such as a and ¨ as one), needs some Unicode categorization table
    • does https://github.com/ridiculousfish/widecharwidth work like that or not?

Number-to-string

For Utility::Debug, Utility::format() etc. The core should be a direct overhead-less API working on builtin types (writing into a statically-sized char[], e.g.), with convenience wrappers above.

  • [ ] integer-to-string conversion instead of stupidly using printf() -- even a dumb while loop would be better
    • https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/
    • https://github.com/jeaiii/itoa (and explanation from https://jk-jeon.github.io/posts/2022/02/jeaiii-algorithm/), doesn't seem to use any explicit SIMD?
  • [ ] separate optimized hex, bin and oct variant
    • https://lemire.me/blog/2021/05/17/converting-binary-integers-to-ascii-characters-apple-m1-vs-amd-zen2/
    • http://0x80.pl/articles/convert-to-hex.html
  • [ ] float-to-string conversion instead of abusing printf()
    • [ ] Ryu is outdated, latest and greatest is https://github.com/jk-jeon/dragonbox but it's C++17, quite large with a lot of options we don't need, do a clean-room implementation from the paper alone?
      • MSVC to_chars used Ryu, not sure if it's still: https://twitter.com/StephanTLavavej/status/992881364142702593
      • Older benchmarks here: https://github.com/miloyip/dtoa-benchmark
    • [ ] Explicit option to write or fail on NaN and Inf (for JSON compat)
    • [ ] Uppercase or lowercase exponent
    • [ ] Option to write hex floats (or a separate function that doesn't need the complex algo?)
  • [ ] Any easy gains with a vectorized variant? "Convert 8 numbers to 8 strings at once"
    • Useful for JSON, OBJ, ... export

String-to-number

Because strto*() has insane usability issues.

  • [ ] string-to-integer conversion instead of stupidly using strtol() -- even a dumb while loop could be better
    • have 8-, 16-, 32- bit variants that check the range on their own
    • remove the stupid handling of - for unsigned types (who thought that was a good idea??)
    • https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
    • https://kholdstare.github.io/technical/2020/05/26/faster-integer-parsing.html and the lengthy post linked from there
  • [ ] separate optimized hex, bin and oct variant
    • http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html
  • [ ] string-to-float conversion instead of using the shitty & slow strtof() / strtod()
    • [ ] What's the algorithm here? Some pointers could be in comments at https://lemire.me/blog/2020/09/10/parsing-floats-in-c-benchmarking-strtod-vs-from_chars/
    • [ ] Explicit option to parse or fail on NaN and Inf (for JSON compat)
    • [ ] Explicit option to write or ignore hex floats (or a separate function that doesn't need the complex algo?)
    • [ ] Also watch out for accumulated rounding errors when going to float through a double: https://www.exploringbinary.com/double-rounding-errors-in-decimal-to-double-to-float-conversions/
  • [ ] Any easy gains with a vectorized variant? "Convert 8 strings to 8 numbers at once"
    • Useful for JSON, OBJ, ... parsing

General printing

  • Is some printf compatibility desired? like https://github.com/tfc/pprintpp
  • Optimization ideas from Rust: https://internals.rust-lang.org/t/changing-core-fmt-for-speed/5154

mosra avatar Mar 31 '22 09:03 mosra