stringi Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding

Open sammo3182 opened this issue 2 years ago • 15 comments

stri_detect_regex looks not recognizing Chinese characters correctly when it is treated as a regex pattern. I'm using the 1.4.0.9000 dev version on R 4.1.0. Here's an example:

Sys.setlocale(, "Chinese")
library(stringi)

stri_detect_fixed("昌平区", "县") # Works fine
#> [1] FALSE
stri_detect_regex("昌平区", "县") # TRUE
#> [1] TRUE
grepl("县", "昌平区") # FALSE
#> [1] FALSE

Another example:

library(dplyr)
library(rvest)
library(stringi)

link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
  html_nodes("p") %>%
    html_text

stri_detect_regex(tx_xi, "同志们")  #Note that these are the very first three characters of the speech

#> [1] FALSE

sessionInfo()
#> R Under development (unstable) (2021-05-17 r80314)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#>
#> Matrix products: default
#>
#> locale:
#>  [1] LC_COLLATE=Chinese (Simplified)_China.936 
#> [2] LC_CTYPE=Chinese (Simplified)_China.936   
#> [3] LC_MONETARY=Chinese (Simplified)_China.936
#> [4] LC_NUMERIC=C                              
#> [5] LC_TIME=Chinese (Simplified)_China.936    
#> system code page: 65001
#>
#> attached base packages:
#>  [1] stats     graphics  grDevices utils     datasets  methods  
#> [7] base     
#>
#> other attached packages:
#>   [1] stringi_1.7.3
#>
#> loaded via a namespace (and not attached):
#>   [1] compiler_4.2.0 tools_4.2.0    parallel_4.2.0

The issue was submitted to stringr (https://github.com/tidyverse/stringr/issues/386#issue-894992244), but it looks like a stringi problem?

Jul 24 '21 00:07 sammo3182

I cannot reproduce the above; I get:

>  library("stringi")
> stri_detect_regex("昌平区", "县")
[1] FALSE
> stri_detect_fixed("昌平区", "县")
[1] FALSE
> grepl("县", "昌平区") 
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.7.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   
>

What does stri_escape_unicode() return on your platform when run on both strings (pattern, search string)? How about charToRaw()? How about utf8ToInt()?
Can you try with a more recent version of the stringi package?

Jul 24 '21 00:07 gagolews

Also, could you please show me the result of a call to stri_info(FALSE)?

Jul 24 '21 00:07 gagolews

With the latter, I get:

stri_detect_regex(tx_xi, "同志们") 
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[69] FALSE FALSE
> tx_xi[1]
[1] "在庆祝中国共产党成立100周年大会上的讲话"

Jul 24 '21 00:07 gagolews

I cannot reproduce the above; I get:

>  library("stringi")
> stri_detect_regex("昌平区", "县")
[1] FALSE
> stri_detect_fixed("昌平区", "县")
[1] FALSE
> grepl("县", "昌平区") 
[1] FALSE
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.7.3

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0   
>

What does stri_escape_unicode() return on your platform when run on both strings (pattern, search string)? How about charToRaw()? How about utf8ToInt()?
Can you try with a more recent version of the stringi package?

Marek, first, thank you so much for helping me with this!! One reason you didn't reproduce my result may be that you alternates the Sys.setlocate to chinese as I showed in the first line of the example. It's important; without it, many outputs in Chinese would just returned the hex unicodes or utf-8 codes. (Yihui has talked about this in many places).

Per your questions, here are what I got:

> stri_escape_unicode("昌平区")
Error in stri_escape_unicode("昌平区") : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode("县")
Error in stri_escape_unicode("县") : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> 
> # According to the error message, I did the the folliwng
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> ?stri_enc_toutf8
> # According to the error message, I did the the folliwng
> stri_enc_toutf8("昌平区")
[1] "昌平区"
> stri_enc_toutf8("县")
[1] "县"
> 
> stri_escape_unicode(stri_enc_toutf8("昌平区"))
Error in stri_escape_unicode(stri_enc_toutf8("昌平区")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> stri_escape_unicode(stri_enc_toutf8("县"))
Error in stri_escape_unicode(stri_enc_toutf8("县")) : 
  invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()
> 
> 
> charToRaw("昌平区")
[1] b2 fd c6 bd c7 f8
> charToRaw("县")
[1] cf d8
> 
> utf8ToInt("昌平区")
[1] NA
> utf8ToInt("县")
[1] NA

> stri_info(FALSE)
$Unicode.version
[1] "13.0"

$ICU.version
[1] "69.1"

$Locale
$Locale$Language
[1] "en"

$Locale$Country
[1] "US"

$Locale$Variant
[1] ""

$Locale$Name
[1] "en_US"


$Charset.internal
[1] "UTF-8"  "UTF-16"

$Charset.native
$Charset.native$Name.friendly
[1] "UTF-8"

$Charset.native$Name.ICU
[1] "UTF-8"

$Charset.native$Name.UTR22
[1] NA

$Charset.native$Name.IBM
[1] "ibm-1208"

$Charset.native$Name.WINDOWS
[1] "windows-65001"

$Charset.native$Name.JAVA
[1] "UTF-8"

$Charset.native$Name.IANA
[1] "UTF-8"

$Charset.native$Name.MIME
[1] "UTF-8"

$Charset.native$ASCII.subset
[1] TRUE

$Charset.native$Unicode.1to1
[1] NA

$Charset.native$CharSize.8bit
[1] FALSE

$Charset.native$CharSize.min
[1] 1

$Charset.native$CharSize.max
[1] 3


$ICU.system
[1] FALSE

$ICU.UTF8
[1] FALSE

>

Does the last couple of lines indicate anything?

Jul 24 '21 04:07 sammo3182

With the latter, I get:

stri_detect_regex(tx_xi, "同志们") 
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[69] FALSE FALSE
> tx_xi[1]
[1] "在庆祝中国共产党成立100周年大会上的讲话"

Sorry for the confusion. My bad for the miscoding. The problem remains, though. Try this:

library(dplyr)
library(rvest)
library(stringi)
#> 
link_speech <- "http://www.xinhuanet.com/politics/2021-07/15/c_1127658385.htm"

tx_xi <- read_html(link_speech) %>% 
+     html_nodes("p") %>%
+     html_text 

tx_xi[6]
#> [1] "同志们，朋友们："
stri_detect_regex(tx_xi[6], "同志们")  #Note that these are the very first three characters of the speech
#> [1] FALSE
#>

Jul 24 '21 04:07 sammo3182

I think the problem is due to:

[2] LC_CTYPE=Chinese (Simplified)_China.936   
...
system code page: 65001

ICU thinks your native encoding is UTF-8, whereas it's probably GBK.

Could you give stri_enc_set("Windows-936") a try?

Jul 24 '21 07:07 gagolews

My, it works! It looks that the error is indeed attributed to the ICU encoding recognition. Once the Windows-936 is set, both the above cases work well! Thank you so much, Marek, for helping me with this issue! I'm not sure if this is an issue only for recognizing Chinese on a PC, but I bet many text analysts would appreciate knowing this issue and the solution above!

Jul 26 '21 00:07 sammo3182

Great, I changed the title of the issue so that it's more searchable.

To sum up, the solution was:

stri_enc_set("Windows-936")

Jul 26 '21 00:07 gagolews

A quick follow-up question: is there any tradeoff by changing the stringi encoding? Or is there a way to let stringi recognize Chinese characters in UTF-8 as UTF-8? The encoding converter seem not to make any difference at all without str_enc_set:

# No str_enc_set is conducted
stri_detect_regex(stri_conv("昌平区", to = "UTF8"), stri_conv("县", to = "UTF8")) 
#> [1] TRUE
# The correct outcome should be false, since the "县" isn't in "昌平区"

Jul 26 '21 00:07 sammo3182

I get FALSE. I think the problem might as well be on your system side, not just stringi, but it's worth digging into it.

Can you call:

charToRaw(stri_conv("昌平区", to = "UTF8"))
charToRaw(stri_conv("县", to = "UTF8"))
charToRaw("昌平区")
charToRaw("县")
stri_enc_mark("昌平区")
stri_enc_mark("县")

Also, try iconv instead of stri_conv

Jul 26 '21 00:07 gagolews

Also, maybe the most recent R - UCRT is worth giving a try? https://github.com/r-windows/docs/blob/master/ucrt.md

Jul 26 '21 00:07 gagolews

iconv works. The PC system is definitely a primary part of the reason of this issue. Nevertheless, I guess, my situate can represent the most system environment of R users in China. In that case, either a stri_enc_set or iconv would work. Of course, if the stringi can offer an argument to do so automatically, it would be great, ha-ha!

Regarding the UCRT, it is definitely intriguing, but it looks only about writing packages? I didn't see there's an instruction showing how I can automatically let Windows to convert everything to UTF-8 at the input stage. If not, UCRT won't be that different from manually converting to UTF-8 with inconv, no?

#> [1] ef bf bd ef bf bd c6 bd ef bf bd ef bf bd
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] ef bf bd ef bf bd
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native"

stri_detect_regex(iconv("昌平区", to = "UTF8"), "县") # supposed to be FALSE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), "县") # supposed to be TRUE
#> [1] FALSE
stri_detect_regex(iconv("昌平县", to = "UTF8"), iconv("县", to = "UTF8")) # supposed to be FALSE
#> [1] TRUE

Jul 26 '21 04:07 sammo3182

Hmmm... are these really generated with stri_enc_set("Windows-936") in place? This needs to be called each time the package is loaded.

The byte sequence ef bf bd denotes the replacement character ("unknown") btw

Jul 26 '21 04:07 gagolews

Oh, I might mislead you! The above outputs were produced without setting the stri_enc_set. As asked in https://github.com/gagolews/stringi/issues/448#issuecomment-886289072, I was seeking solutions that I don't have to reset the stri_enc_set. Everything works fine when the encoding is manually set:

library(stringi)
stri_enc_set("Windows-936")
#> New settings: stringi_1.7.3 (en_US.GBK; ICU4C 69.1 [bundle]; Unicode 13.0)
#> Warning message:
#> In stri_info(short = TRUE) :
#>   Your native charset does not map to Unicode well. This may cause serious problems. Consider switching to UTF-8.
charToRaw(stri_conv("昌平区", to = "UTF8"))
#> [1] e6 98 8c e5 b9 b3 e5 8c ba
charToRaw(stri_conv("县", to = "UTF8"))
#> [1] e5 8e bf
charToRaw("昌平区")
#> [1] b2 fd c6 bd c7 f8
charToRaw("县")
#> [1] cf d8
stri_enc_mark("昌平区")
#> [1] "native"
stri_enc_mark("县")
#> [1] "native"

Jul 26 '21 04:07 sammo3182

Dear all, has anyone working in this locale experienced similar issues?

Jul 26 '21 05:07 gagolews

stringi stringi copied to clipboard

Problem with auto-detecting the Windows-936 (GBK, simplified Chinese) encoding

stringi
stringi copied to clipboard