text icon indicating copy to clipboard operation
text copied to clipboard

Consider adopting `simdutf` as a possible transcoding backend

Open DJm00n opened this issue 2 years ago • 1 comments

This library provide fast Unicode functions such as:

  • ASCII, UTF-8, UTF-16LE/BE and UTF-32 validation, with and without error identification,
  • Latin1 to UTF-8 transcoding,
  • Latin1 to UTF-16LE/BE transcoding
  • Latin1 to UTF-32 transcoding
  • UTF-8 to Latin1 transcoding, with or without validation, with and without error identification,
  • UTF-8 to UTF-16LE/BE transcoding, with or without validation, with and without error identification,
  • UTF-8 to UTF-32 transcoding, with or without validation, with and without error identification,
  • UTF-16LE/BE to Latin1 transcoding, with or without validation, with and without error identification,
  • UTF-16LE/BE to UTF-8 transcoding, with or without validation, with and without error identification,
  • UTF-32 to Latin1 transcoding, with or without validation, with and without error identification,
  • UTF-32 to UTF-8 transcoding, with or without validation, with and without error identification,
  • UTF-32 to UTF-16LE/BE transcoding, with or without validation, with and without error identification,
  • UTF-16LE/BE to UTF-32 transcoding, with or without validation, with and without error identification,
  • From an UTF-8 string, compute the size of the Latin1 equivalent string,
  • From an UTF-8 string, compute the size of the UTF-16 equivalent string,
  • From an UTF-8 string, compute the size of the UTF-32 equivalent string (equivalent to UTF-8 character counting),
  • From an UTF-16LE/BE string, compute the size of the Latin1 equivalent string,
  • From an UTF-16LE/BE string, compute the size of the UTF-8 equivalent string,
  • From an UTF-32 string, compute the size of the UTF-8 or UTF-16LE equivalent string,
  • From an UTF-16LE/BE string, compute the size of the UTF-32 equivalent string (equivalent to UTF-16 character counting),
  • UTF-8 and UTF-16LE/BE character counting.
  • UTF-16 endianness change (UTF16-LE/BE to UTF-16-BE/LE)

The functions are accelerated using SIMD instructions (e.g., ARM NEON, SSE, AVX, AVX-512, etc.). When your strings contain hundreds of characters, we can often transcode them at speeds exceeding a billion characters per second.

See https://github.com/simdutf/simdutf

DJm00n avatar Sep 20 '23 17:09 DJm00n

It can be used separately, why?

MBkkt avatar Dec 23 '23 14:12 MBkkt