Marek Gagolewski comments

Results 118 comments of


Marek Gagolewski

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

I wonder if strength=2 is what you might need: ```r stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", strength = 2L,case_level = TRUE, locale="pt_BR") ## [1] TRUE TRUE FALSE FALSE stringi::stri_detect_coll(c("Mario", "mario", "Mário",...

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

First of all, thanks, there was a bug; `locale=""` should mean `locale=NULL`, i.e., your own locale, `pt_BR`.

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

instead, `locale=''` meant `locale="POSIX"` - that is why it worked as expected (and perhaps this is what postgresql uses, hence the correct results). I would recommend setting locale="POSIX" explicitly then....

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

Hmmm.... interestingly, a collator-based string comparison honours the above rule... ```r > stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=2L) [1] FALSE TRUE FALSE FALSE > stringi::stri_cmp_equiv(c("Mario", "mario", "Mário", "mário"), "mario",...

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

I was trying hard to figure out why `usearch` returns a different result below, but with no success. A bug in ICU? ```r stringi::stri_detect_coll(c("Mario", "mario", "Mário", "mário"), "mario", case_level=TRUE, strength=1L)...

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

(note to self): ICU 69.1 gives the results as above. @TODO: create a minimal reproducible example outside of stringi

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

[note to self] Yes, this is reproducible outside of stringi: ```c++ /* g++ -std=c++11 icu_test_bug_ucol_caselevel.cpp -licui18n -licuuc -licudata && ./a.out */ #include #include #include #include #include #include using namespace icu;...

Collation: case_level doesn't seem to work properly to ignore accents but take case into account (ICU StringSearch and UCollator; UCOL_STRENGTH=UCOL_PRIMARY, UCOL_CASE_LEVEL=UCOL_ON)

All right, it turns out that this issue has already been reported. It is ICU-related. https://unicode-org.atlassian.net/browse/ICU-21338

Which functions should preserve objects' attributes?

dim, names and dimnames? see `mostattributes` in `?attributes`

Building in ALT-REP to stringi

I was actually thinking about giving `stringi` a major re-write for quite a long time. Now that the Windows-UCRT build of R assumes all strings are natively UTF-8, and the...