re2r icon indicating copy to clipboard operation
re2r copied to clipboard

Match failure when LC_COLLATE is not UTF-8

Open gagolews opened this issue 8 years ago • 1 comments

e.g., Windows does not have a UTF-8 locale set by default

gagolews avatar Apr 23 '16 13:04 gagolews

Now the behavior is incorrect:

[gagolews@zeus tmp]$ LC_ALL="pl_PL.iso-8859-2" R

R Under development (unstable) (2016-04-14 r70486) -- "Unsuffered Consequences"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> library("stringi")
> x <- stri_conv("a\u0105bc", "UTF-8", "")
> library(re2r) 
> re2_match("\u0105", x)
[1] FALSE
> re2_match(x, "\u0105")
B��D: invalid UTF-8 in regexp: 
> stri_extract_all_regex(x, "\u0105")    # this is OK
[[1]]
[1] "�"

consider converting all input strings to utf8, preferably with `stringi::stri_enc_toutf8``

gagolews avatar Apr 23 '16 13:04 gagolews