feather icon indicating copy to clipboard operation
feather copied to clipboard

[R] Encoding info losses for non-ASCII column names

Open shrektan opened this issue 7 years ago • 1 comments

If the column names contain non-ASCII strings, the Encoding info will be lost when reading from the local feather file. The following example is run on my Mac.

It will be even worse if it's run on a Windows machine because it seems like feather will try to convert the column names to native encoding from unknown encoding, leading to garbage column names that can never be converted back.

Minimal Reproducible Example

utf8_strings <- c("çile", "façile", "El. paÅ¡tas", "¡tas", "Þ")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
tbl <- data.frame(utf8_strings, latin1_strings, stringsAsFactors = FALSE)
colnames(tbl) <- c(utf8_strings[2], latin1_strings[2])
tbl2 <- local({
  tmp_file <- tempfile(fileext = ".feather")
  on.exit(unlink(tmp_file), add = TRUE)
  feather::write_feather(tbl, tmp_file)
  feather::read_feather(tmp_file)
})
colnames(tbl)
#> [1] "façile" "façile"
colnames(tbl2)
#> [1] "façile"    "fa\xe7ile" ############SEE HERE############
Encoding(colnames(tbl))
#> [1] "UTF-8"  "latin1"
Encoding(colnames(tbl2))
#> [1] "unknown" "unknown"
Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
colnames(tbl2)
#> [1] "façile" "façile"

sessionInfo()

#> R version 3.4.3 (2017-11-30)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.16    digest_0.6.15   rprojroot_1.3-2 backports_1.1.2
#>  [5] formatR_1.5     magrittr_1.5    evaluate_0.10.1 pillar_1.2.1   
#>  [9] rlang_0.2.0     stringi_1.1.7   rmarkdown_1.9   tools_3.4.3    
#> [13] stringr_1.3.0   feather_0.3.1   hms_0.4.2       yaml_2.1.18    
#> [17] compiler_3.4.3  pkgconfig_2.0.1 htmltools_0.3.6 knitr_1.20     
#> [21] tibble_1.4.2

On Windows the output will become

> colnames(tbl)
[1] "façile"    "fa<e7>ile"
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile"    
> Encoding(colnames(tbl))
[1] "UTF-8"  "latin1"
> Encoding(colnames(tbl2))
[1] "unknown" "unknown"
> Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
> colnames(tbl2)
[1] "fa<U+00E7>ile" "fa<e7>ile ######NOTICE THE FIRST ONE###########

shrektan avatar Apr 09 '18 13:04 shrektan

Does this issue persist in the arrow library?

wesm avatar Apr 10 '20 01:04 wesm