stringi
stringi copied to clipboard
Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding
stri_detect_regex
looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:
Sys.setlocale(, "Chinese")
library(stringi)
stri_detect_fixed("昌平区", "县") # Works fine
#> [1] FALSE
stri_detect_regex("昌平区", "县") # TRUE
#> [1] TRUE
grepl("县", "昌平区") # FALSE
#> [1] FALSE
Another example:
library(dplyr)
library(rvest)
library(stringi)
link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"
tx_xi <- read_html(link_speech) %>%
html_nodes("p") %>%
html_text
stri_detect_regex(tx_xi, "同志们") #Note that these are the very first three characters of the speech
#> [1] FALSE
sessionInfo()
#> R Under development (unstable) (2021-05-17 r80314)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Chinese (Simplified)_China.936
#> [2] LC_CTYPE=Chinese (Simplified)_China.936
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=Chinese (Simplified)_China.936
#> system code page: 65001
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods
#> [7] base
#>
#> other attached packages:
#> [1] stringi_1.7.3
#>
#> loaded via a namespace (and not attached):
#> [1] compiler_4.2.0 tools_4.2.0 parallel_4.2.0
The issue was submitted to stringr
(https://github.com/tidyverse/stringr/issues/386#issue-894992244), but it looks like a stringi
problem?
I cannot reproduce the above; I get:
> library("stringi")
> stri_detect_regex("昌平区", "县")
[1] FALSE
> stri_detect_fixed("昌平区", "县")
[1] FALSE
> grepl("县", "昌平区")
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringi_1.7.3
loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0
>
- What does
stri_escape_unicode()
return on your platform when run on both strings (pattern, search string)? How aboutcharToRaw()
? How aboututf8ToInt()
? - Can you try with a more recent version of the stringi package?
Also, could you please show me the result of a call to stri_info(FALSE)
?
With the latter, I get:
stri_detect_regex(tx_xi, "同志们")
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[69] FALSE FALSE
> tx_xi[1]
[1] "在庆祝中国共产党成立100周年大会上的讲话"
I cannot reproduce the above; I get:
> library("stringi") > stri_detect_regex("昌平区", "县") [1] FALSE > stri_detect_fixed("昌平区", "县") [1] FALSE > grepl("县", "昌平区") [1] FALSE > sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 21.04 Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 [4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] stringi_1.7.3 loaded via a namespace (and not attached): [1] compiler_4.1.0 tools_4.1.0 >
- What does
stri_escape_unicode()
return on your platform when run on both strings (pattern, search string)? How aboutcharToRaw()
? How aboututf8ToInt()
?- Can you try with a more recent version of the stringi package?
Marek, first, thank you so much for helping me with this!!
One reason you didn't reproduce my result may be that you alternates the Sys.setlocate
to chinese
as I showed in the first line of the example. It's important; without it, many outputs in Chinese would just returned the hex unicodes or utf-8 codes. (Yihui has talked about this in many places).
Per your questions, here are what I got:
> stri_escape_unicode("昌平区")
Error in stri_escape_unicode("昌平区") :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode("县")
Error in stri_escape_unicode("县") :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
>
> # According to the error message, I did the the folliwng
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> ?stri_enc_toutf8
> # According to the error message, I did the the folliwng
> stri_enc_toutf8("昌平区")
[1] "昌平区"
> stri_enc_toutf8("县")
[1] "县"
>
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode(stri_enc_toutf8("县"))
Error in stri_escape_unicode(stri_enc_toutf8("县")) :
invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
>
>
> charToRaw("昌平区")
[1] b2 fd c6 bd c7 f8
> charToRaw("县")
[1] cf d8
>
> utf8ToInt("昌平区")
[1] NA
> utf8ToInt("县")
[1] NA
> stri_info(FALSE)
$Unicode.version
[1] "13.0"
$ICU.version
[1] "69.1"
$Locale
$Locale$Language
[1] "en"
$Locale$Country
[1] "US"
$Locale$Variant
[1] ""
$Locale$Name
[1] "en_US"
$Charset.internal
[1] "UTF-8" "UTF-16"
$Charset.native
$Charset.native$Name.friendly
[1] "UTF-8"
$Charset.native$Name.ICU
[1] "UTF-8"
$Charset.native$Name.UTR22
[1] NA
$Charset.native$Name.IBM
[1] "ibm-1208"
$Charset.native$Name.WINDOWS
[1] "windows-65001"
$Charset.native$Name.JAVA
[1] "UTF-8"
$Charset.native$Name.IANA
[1] "UTF-8"
$Charset.native$Name.MIME
[1] "UTF-8"
$Charset.native$ASCII.subset
[1] TRUE
$Charset.native$Unicode.1to1
[1] NA
$Charset.native$CharSize.8bit
[1] FALSE
$Charset.native$CharSize.min
[1] 1
$Charset.native$CharSize.max
[1] 3
$ICU.system
[1] FALSE
$ICU.UTF8
[1] FALSE
>
Does the last couple of lines indicate anything?
With the latter, I get:
stri_detect_regex(tx_xi, "同志们") [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [18] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [52] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE [69] FALSE FALSE > tx_xi[1] [1] "在庆祝中国共产党成立100周年大会上的讲话"
Sorry for the confusion. My bad for the miscoding. The problem remains, though. Try this:
library(dplyr)
library(rvest)
library(stringi)
#>
link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"
tx_xi <- read_html(link_speech) %>%
+ html_nodes("p") %>%
+ html_text
tx_xi[6]
#> [1] "同志们,朋友们:"
stri_detect_regex(tx_xi[6], "同志们") #Note that these are the very first three characters of the speech
#> [1] FALSE
#>
I think the problem is due to:
[2] LC_CTYPE=Chinese (Simplified)_China.936
...
system code page: 65001
ICU thinks your native encoding is UTF-8, whereas it's probably GBK.
Could you give stri_enc_set("Windows-936")
a try?
My, it works! It looks that the error is indeed attributed to the ICU encoding recognition. Once the Windows-936
is set, both the above cases work well! Thank you so much, Marek, for helping me with this issue! I'm not sure if this is an issue only for recognizing Chinese on a PC, but I bet many text analysts would appreciate knowing this issue and the solution above!
Great, I changed the title of the issue so that it's more searchable.
To sum up, the solution was:
stri_enc_set("Windows-936")
A quick follow-up question: is there any tradeoff by changing the stringi encoding? Or is there a way to let stringi
recognize Chinese characters in UTF-8 as UTF-8? The encoding converter seem not to make any difference at all without str_enc_set
:
# No str_enc_set is conducted
stri_detect_regex(stri_conv("昌平区", to = "UTF8"), stri_conv("县", to = "UTF8"))
#> [1] TRUE
# The correct outcome should be false, since the "县" isn't in "昌平区"
I get FALSE
. I think the problem might as well be on your system side, not just stringi, but it's worth digging into it.
Can you call:
-
charToRaw(stri_conv("昌平区", to = "UTF8"))
-
charToRaw(stri_conv("县", to = "UTF8"))
-
charToRaw("昌平区")
-
charToRaw("县")
-
stri_enc_mark("昌平区")
-
stri_enc_mark("县")
Also, try iconv
instead of stri_conv
Also, maybe the most recent R - UCRT is worth giving a try? https://github.com/r-windows/docs/blob/master/ucrt.md
iconv
works. The PC system is definitely a primary part of the reason of this issue. Nevertheless, I guess, my situate can represent the most system environment of R users in China. In that case, either a stri_enc_set
or iconv
would work. Of course, if the stringi
can offer an argument to do so automatically, it would be great, ha-ha!
Regarding the UCRT, it is definitely intriguing, but it looks only about writing packages? I didn't see there's an instruction showing how I can automatically let Windows to convert everything to UTF-8 at the input stage. If not, UCRT won't be that different from manually converting to UTF-8 with inconv
, no?
#> [1] ef bf bd ef bf bd c6 bd ef bf bd ef bf bd
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] ef bf bd ef bf bd
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native"
stri_detect_regex(iconv("昌平区", to = "UTF8"), "县") # supposed to be FALSE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), "县") # supposed to be TRUE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), iconv("县", to = "UTF8")) # supposed to be FALSE
#> [1] TRUE
Hmmm... are these really generated with stri_enc_set("Windows-936")
in place? This needs to be called each time the package is loaded.
The byte sequence ef bf bd
denotes the replacement character ("unknown") btw
Oh, I might mislead you! The above outputs were produced without setting the stri_enc_set
. As asked in https://github.com/gagolews/stringi/issues/448#issuecomment-886289072, I was seeking solutions that I don't have to reset the stri_enc_set
. Everything works fine when the encoding is manually set:
library(stringi)
stri_enc_set("Windows-936")
#> New settings: stringi_1.7.3 (en_US.GBK; ICU4C 69.1 [bundle]; Unicode 13.0)
#> Warning message:
#> In stri_info(short = TRUE) :
#> Your native charset does not map to Unicode well. This may cause serious problems. Consider switching to UTF-8.
charToRaw(stri_conv("昌平区", to = "UTF8"))
#> [1] e6 98 8c e5 b9 b3 e5 8c ba
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] e5 8e bf
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native"
:)
Dear all, has anyone working in this locale experienced similar issues?