readtext
readtext copied to clipboard
readtext doesn't perform Unicode normalization
see https://github.com/kbenoit/quanteda/issues/189
See the stuff I started in https://github.com/kbenoit/quanteda/tree/dev_unicodeNorm for addressing this issue.
Here's the only function from that branch, which you can work into readtext: (and now I can delete the branch from quanteda:
## internal function to perform unicode normalization
## called from other functions as quanteda:::unicodeNorm(x)
##
unicodeNorm <- function(x, type = c("nfc", "nfd", "nfkd", "nfkc_casefold")) {
if (!is.character(x)) stop("input must be character")
type <- match.arg(type)
switch(type,
nfc = stringi::stri_trans_nfc(x),
nfd = stringi::stri_trans_nfd(x),
nfkd = stringi::stri_trans_nfkd(x),
nfkc = stringi::stri_trans_nfkc(x),
nfkc_casefold = stringi::stri_trans_nfkc_casefold(x))
}
Implemented in PR #52
Putting this issue aside for now; see branch feature/unicode_normalisation.