readtext
readtext copied to clipboard
Add encoding inference function
The EU manifesto example is incorrect, because Hungarian text, for example, is not in ISO-8859-1. https://readtext.quanteda.io/articles/readtext_vignette.html#reading-one-or-more-text-files
However, it is tedious to specify encoding manually. Why not doing like this? stri_enc_detect() is making good guess.
path_data <- system.file("extdata/", package = "readtext")
for (f in list.files(paste0(path_data, "/txt/EU_manifestos/"), full.names = TRUE)) {
print(f)
enc <- stringi::stri_enc_detect(readBin(file(f, 'rb'), character()))
print(enc[[1]][1:2,])
}
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_PSE.txt"
Encoding Language Confidence
1 ISO-8859-1 de 0.80
2 ISO-8859-9 tr 0.24
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_de_V.txt"
Encoding Language Confidence
1 ISO-8859-1 de 0.83
2 ISO-8859-9 tr 0.26
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_PSE.txt"
Encoding Language Confidence
1 ISO-8859-1 en 0.75
2 ISO-8859-2 ro 0.21
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_en_V.txt"
Encoding Language Confidence
1 ISO-8859-1 en 0.75
2 ISO-8859-2 ro 0.21
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_PSE.txt"
Encoding Language Confidence
1 ISO-8859-1 es 0.91
2 ISO-8859-2 ro 0.35
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_es_V.txt"
Encoding Language Confidence
1 ISO-8859-1 es 0.88
2 ISO-8859-2 ro 0.36
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fi_V.txt"
Encoding Language Confidence
1 ISO-8859-1 sv 0.20
2 ISO-8859-9 tr 0.17
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_PSE.txt"
Encoding Language Confidence
1 ISO-8859-1 fr 0.94
2 ISO-8859-2 ro 0.35
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_fr_V.txt"
Encoding Language Confidence
1 ISO-8859-1 fr 0.92
2 ISO-8859-2 ro 0.37
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_gr_V.txt"
Encoding Language Confidence
1 ISO-8859-7 el 0.74
2 UTF-16BE 0.10
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_hu_V.txt"
Encoding Language Confidence
1 ISO-8859-2 hu 0.53
2 ISO-8859-1 en 0.16
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_it_PSE.txt"
Encoding Language Confidence
1 ISO-8859-1 it 0.83
2 ISO-8859-2 ro 0.43
[1] "/home/kohei/R/x86_64-pc-linux-gnu-library/3.6/readtext/extdata//txt/EU_manifestos//EU_euro_2004_lv_V.txt"
Error in enc[[1]] : subscript out of bounds
In addition: There were 13 warnings (use warnings() to see them)