rio icon indicating copy to clipboard operation
rio copied to clipboard

Consider setting "." as default decimal separator when column deliminator is ";"

Open sebastiansauer opened this issue 5 years ago • 8 comments

Please specify whether your issue is about:

  • [ ] a possible bug
  • [ ] a question about package functionality
  • [x] a suggested code or documentation change, improvement to the code, or feature request

I would like to use rio::import() as a wrapper around read_csv2(). The good thing is that import() will guess the deliminator (eg., ";") correctly. However, it does not reckognize the decimal separator "."; the point as decimal separator is what is used in read_csv2().

Please consider to use the point "." as standard decimal separator if the separator is ";".

Of course it is a quick thing to add the parameter by hand when I use import(); however, I work a lot with students being introduced to R. It's a constand hinderance for them to correctly get the decimal and column separator.

This feature would be a plus above eg., read_csv2().

See this example:

library(rio)

data("mtcars")

mtcars$hp[1] <- 100.1

m2 <- write.csv2(mtcars, "mtcars2.csv")


m2_ohno <- rio::import("mtcars2.csv")
glimpse(m2_ohno)

str(m2_ohno$hp)  #string, not a number
chr [1:32] "100,1" "110" "93" "110" "175" "105" "245" "62" "
## session info for your system
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rio_0.5.10      forcats_0.3.0   stringr_1.3.1   dplyr_0.7.6     purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2    ggplot2_3.0.0   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.4  reshape2_1.4.3    haven_1.1.2       lattice_0.20-35   colorspace_1.3-2  yaml_2.1.19       rlang_0.2.1       pillar_1.2.3      foreign_0.8-70    glue_1.2.0.9000   withr_2.1.2       modelr_0.1.2     
[13] readxl_1.1.0      bindrcpp_0.2.2    bindr_0.1.1       plyr_1.8.4        munsell_0.5.0     gtable_0.2.0      cellranger_1.1.0  zip_1.0.0         rvest_0.3.2       psych_1.8.4       knitr_1.20        parallel_3.5.0   
[25] curl_3.2          broom_0.4.5       Rcpp_0.12.17      clipr_0.4.1       scales_0.5.0      jsonlite_1.5      mnormt_1.5-5      hms_0.4.2         stringi_1.2.3     openxlsx_4.1.0    grid_3.5.0        cli_1.0.0        
[37] tools_3.5.0       magrittr_1.5      lazyeval_0.2.1    crayon_1.3.4      pkgconfig_2.0.1   data.table_1.11.4 xml2_1.2.0        datapasta_3.0.0   lubridate_1.7.4   assertthat_0.2.0  httr_1.3.1        rstudioapi_0.7   
[49] R6_2.2.2          nlme_3.1-137      compiler_3.5.0   

sebastiansauer avatar Jul 12 '18 11:07 sebastiansauer

This is already supported, you just need to specify format. Default is "US-style" CSV but you can "Euro-style" CSV with either of:

> str(rio::import("mtcars2.csv", format = "csv2")$hp)
 num [1:32] 100 110 93 110 175 ...
> str(rio::import("mtcars2.csv", format = ";")$hp)
 num [1:32] 100 110 93 110 175 ...

leeper avatar Jul 13 '18 07:07 leeper

Thanks for the clarification. Is there a way to rio::import a CSV without knowing in advance what format is? Like: "Gimme a csv, don't tell me its format, I will import it correctly".

sebastiansauer avatar Jul 16 '18 08:07 sebastiansauer

@leeper can check me here, but:

That should be handled under the hood by data.table::fread(), which rio::import() wraps as its call to delimited files. The flat-file methods in this package end up passing sep = "auto" to fread(), regardless of the specific file extension, which will try & determine the format automatically.

However, it looks like there's a stop() call if the file has no extension, so if you don't know the actual file format (i.e. delimiter), and the file has no extension, just give the file a "pretend" .csv extension, and import() will still determine the best format via fread(). Or, leave the extension off and pass format = "csv", to the same effect.

ryapric avatar Jul 16 '18 14:07 ryapric

Hi @ryapric, it appears that this auto detection does not work. Consider the following example where a "100.1" (as numeric) is saved in csv2 format. rio:import() does not recknognize the decimal sepatator (".") correctly in this case.

library(rio)

data("mtcars")

mtcars$hp[1] <- 100.1

m2 <- write.csv2(mtcars, "mtcars2.csv")


m2_ohno <- rio::import("mtcars2.csv")
glimpse(m2_ohno)

str(m2_ohno$hp)  #string, not a number
chr [1:32] "100,1" "110" "93" "110" "175" "105" "245" "62" 

Let me know if I'm missing out something. In any case, it's not essential.

sebastiansauer avatar Jul 16 '18 15:07 sebastiansauer

I don't have time to look into this in detail at the moment, but the challenge here is really a format ambiguity issue. Comma-style and semicolon-style CSVs are ultimately two different formats that are given the same file extension. fread() tries to deal with these but to dispatch to fread(), we need to have an idea of what kind of file it is and we can't tell from the file extension (which is the only info we have unless you specify format). So making everything seamless is a bit hard.

leeper avatar Jul 16 '18 18:07 leeper

I agree that this is not easy.

Maybe a way could be to write a function guess_deliminator() that reads the first 1000 lines and counts commas and semicolons. Whatever is more frequent is suggested as deliminator. Thoughts in progress.

sebastiansauer avatar Jul 17 '18 08:07 sebastiansauer

My inclination is not to do that, given that fread() (which is what rio is ultimately using for the import) is all about speed. Feels a bit awkward to read in part of the file only to then read in the rest of the file using fread().

Even readr::read_csv() (which actually does parse the first 1000 rows to detect column types) fails on this use case (and readr::read_csv2() succeeds, just as rio::import(... , format = "csv2"), etc. do).

It's apparently a quite tricky problem. Again, this is because ".csv" is ambiguous and humans might be better at resolving that ambiguity than computers.

leeper avatar Jul 21 '18 13:07 leeper

That's reasonable. Maybe I come up with a function such as guess_format(). But maybe it's not worth the fuss...

sebastiansauer avatar Jul 23 '18 08:07 sebastiansauer