textclean icon indicating copy to clipboard operation
textclean copied to clipboard

Could "Any-Latin; Latin-ASCII" be added to replace_non_ascii() to address logographics/cyrillic/devanagari?

Open dustinstoltz opened this issue 3 years ago • 0 comments

I see that replace_non_ascii() uses stringi::stri_trans_general(x, "latin-ascii")

This doesn't seem to work for logographic, Cyrillic, or Devanagari characters:

library(stringi)
x <-  c("キャンパス", "재미", "wylądować", "Дорога", "heiß", "Raül", 'brûlée', "भोजन")
Encoding(x) <- "UTF-8"
stri_trans_general(in_str, id = "Latin-ASCII")
[1] "キャンパス" "재미"       "wyladowac"  "Дорога"     "heiss"     
[6] "Raul"       "brulee"     "भोजन" 

The function could first transliterate to Any-Latin and then to Latin-ASCII, which seems a safer default:

stri_trans_general(x, id = "Any-Latin; Latin-ASCII")
[1] "kyanpasu"  "jaemi"     "wyladowac" "Doroga"    "heiss"     "Raul"     
[7] "brulee"    "bhojana"

Just a thought -- love the package!

dustinstoltz avatar Jun 17 '22 13:06 dustinstoltz