readtext icon indicating copy to clipboard operation
readtext copied to clipboard

readtext doesn't perform Unicode normalization

Open adamobeng opened this issue 9 years ago • 4 comments

see https://github.com/kbenoit/quanteda/issues/189

adamobeng avatar Nov 02 '16 14:11 adamobeng

See the stuff I started in https://github.com/kbenoit/quanteda/tree/dev_unicodeNorm for addressing this issue.

kbenoit avatar Nov 02 '16 20:11 kbenoit

Here's the only function from that branch, which you can work into readtext: (and now I can delete the branch from quanteda:

## internal function to perform unicode normalization
## called from other functions as quanteda:::unicodeNorm(x)
##
unicodeNorm <- function(x, type = c("nfc", "nfd", "nfkd", "nfkc_casefold")) {
    if (!is.character(x)) stop("input must be character")
    type <- match.arg(type)

    switch(type,
           nfc = stringi::stri_trans_nfc(x),
           nfd = stringi::stri_trans_nfd(x),
           nfkd = stringi::stri_trans_nfkd(x),
           nfkc = stringi::stri_trans_nfkc(x),
           nfkc_casefold = stringi::stri_trans_nfkc_casefold(x))
} 

kbenoit avatar Nov 05 '16 14:11 kbenoit

Implemented in PR #52

adamobeng avatar Jan 09 '17 18:01 adamobeng

Putting this issue aside for now; see branch feature/unicode_normalisation.

kbenoit avatar May 17 '17 05:05 kbenoit