stringi icon indicating copy to clipboard operation
stringi copied to clipboard

Does stringi export something like `u_hasBinaryProperty(c, UCHAR_ALPHABETIC)`?

Open dmurdoch opened this issue 11 months ago • 3 comments

I am writing a parser for LaTeX code, and I'm hoping to support UTF-8 input. TeX and LaTeX categorize each input character, and one of the categories is whether it is a letter or not. I'm not sure how the Unicode-supporting versions of LaTeX handle this, but one thing I wanted to try was to use the ICU test u_hasBinaryProperty(c, UCHAR_ALPHABETIC). That's the only ICU function I need, so linking ICU into my package is possible but seems like overkill.

Does stringi provide this kind of categorization of the characters in a string? Ideally it would be something I could call from C, but if it's only available from R that would be very helpful too. I couldn't spot it in the reference docs, but maybe I just missed it.

dmurdoch avatar Feb 01 '25 17:02 dmurdoch

As per Sec. 5.4.3 of Writing R Extensions, I've made this function available via R_GetCCallable (in the current development version of stringi). It's declared as

int stric_u_hasBinaryProperty(int c, int which);

See https://github.com/gagolews/stringi/blob/master/src/stri_callables.cpp

Let me know if that works for you?

gagolews avatar Feb 03 '25 13:02 gagolews

UCHAR_ALPHABETIC is 0 (https://github.com/unicode-org/icu/blob/main/icu4c/source/common/unicode/uchar.h) /this is very unlikely to change in the future/

gagolews avatar Feb 03 '25 13:02 gagolews

Thanks! I'll give it a try.

dmurdoch avatar Feb 03 '25 13:02 dmurdoch