feather
feather copied to clipboard
[R] Encoding info losses for non-ASCII column names
If the column names contain non-ASCII strings, the Encoding info will be lost when reading from the local feather file. The following example is run on my Mac.
It will be even worse if it's run on a Windows machine because it seems like feather will try to convert the column names to native encoding from unknown encoding, leading to garbage column names that can never be converted back.
Minimal Reproducible Example
utf8_strings <- c("çile", "façile", "El. paÅ¡tas", "¡tas", "Þ")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
tbl <- data.frame(utf8_strings, latin1_strings, stringsAsFactors = FALSE)
colnames(tbl) <- c(utf8_strings[2], latin1_strings[2])
tbl2 <- local({
tmp_file <- tempfile(fileext = ".feather")
on.exit(unlink(tmp_file), add = TRUE)
feather::write_feather(tbl, tmp_file)
feather::read_feather(tmp_file)
})
colnames(tbl)
#> [1] "façile" "façile"
colnames(tbl2)
#> [1] "façile" "fa\xe7ile" ############SEE HERE############
Encoding(colnames(tbl))
#> [1] "UTF-8" "latin1"
Encoding(colnames(tbl2))
#> [1] "unknown" "unknown"
Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
colnames(tbl2)
#> [1] "façile" "façile"
sessionInfo()
#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.16 digest_0.6.15 rprojroot_1.3-2 backports_1.1.2
#> [5] formatR_1.5 magrittr_1.5 evaluate_0.10.1 pillar_1.2.1
#> [9] rlang_0.2.0 stringi_1.1.7 rmarkdown_1.9 tools_3.4.3
#> [13] stringr_1.3.0 feather_0.3.1 hms_0.4.2 yaml_2.1.18
#> [17] compiler_3.4.3 pkgconfig_2.0.1 htmltools_0.3.6 knitr_1.20
#> [21] tibble_1.4.2
On Windows the output will become
> colnames(tbl)
[1] "façile" "fa<e7>ile"
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile"
> Encoding(colnames(tbl))
[1] "UTF-8" "latin1"
> Encoding(colnames(tbl2))
[1] "unknown" "unknown"
> Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile ######NOTICE THE FIRST ONE###########
Does this issue persist in the arrow library?