StringZilla icon indicating copy to clipboard operation
StringZilla copied to clipboard

Case-insensitive Unicode manipulation

Open ashvardanian opened this issue 2 years ago • 1 comments

Python strings offer a lot of powerful methods, such as:

  • isalnum, isalpha, isascii, isdecimal, isdigit, isspace, islower, isupper, istitle, isnumeric for checks.
  • lower and upper that copy the string.
  • casfold described in section 3.13 of the Unicode Standard.

There are very few C-level libraries that provide such functionality, and most of them are not characterized by speed. Covering a subset of that functionality in StringZilla makes sense.

ashvardanian avatar Sep 23 '23 07:09 ashvardanian

Starting with v3, part of this functionality is already available for ASCII strings. Implementing the same for UTF8 would involve preparing huge dictionaries, and potentially designing some SIMD-friendly trie or automata. So we are not rushing those features for now.

ashvardanian avatar Feb 27 '24 22:02 ashvardanian