readtext icon indicating copy to clipboard operation
readtext copied to clipboard

Add encoding inference function

Open koheiw opened this issue 6 years ago • 0 comments

The EU manifesto example is incorrect, because Hungarian text, for example, is not in ISO-8859-1. https://readtext.quanteda.io/articles/readtext_vignette.html#reading-one-or-more-text-files

However, it is tedious to specify encoding manually. Why not doing like this? stri_enc_detect() is making good guess.

path_data <- system.file("extdata/", package = "readtext")

for (f in list.files(paste0(path_data, "/txt/EU_manifestos/"), full.names = TRUE)) {
  print(f)
  enc <- stringi::stri_enc_detect(readBin(file(f, 'rb'), character()))
  print(enc[[1]][1:2,])
}
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       de       0.80
2 ISO-8859-9       tr       0.24
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       de       0.83
2 ISO-8859-9       tr       0.26
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       en       0.75
2 ISO-8859-2       ro       0.21
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       en       0.75
2 ISO-8859-2       ro       0.21
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       es       0.91
2 ISO-8859-2       ro       0.35
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       es       0.88
2 ISO-8859-2       ro       0.36
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fi_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       sv       0.20
2 ISO-8859-9       tr       0.17
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       fr       0.94
2 ISO-8859-2       ro       0.35
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_V.txt"
    Encoding Language Confidence
1 ISO-8859-1       fr       0.92
2 ISO-8859-2       ro       0.37
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_gr_V.txt"
    Encoding Language Confidence
1 ISO-8859-7       el       0.74
2   UTF-16BE                0.10
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_hu_V.txt"
    Encoding Language Confidence
1 ISO-8859-2       hu       0.53
2 ISO-8859-1       en       0.16
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_it_PSE.txt"
    Encoding Language Confidence
1 ISO-8859-1       it       0.83
2 ISO-8859-2       ro       0.43
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_lv_V.txt"
Error in enc[[1]] : subscript out of bounds
In addition: There were 13 warnings (use warnings() to see them)

koheiw avatar Jul 26 '19 16:07 koheiw