rio
rio copied to clipboard
Consider setting "." as default decimal separator when column deliminator is ";"
Please specify whether your issue is about:
- [ ] a possible bug
- [ ] a question about package functionality
- [x] a suggested code or documentation change, improvement to the code, or feature request
I would like to use rio::import()
as a wrapper around read_csv2()
. The good thing is that import()
will guess the deliminator (eg., ";") correctly. However, it does not reckognize the decimal separator "."; the point as decimal separator is what is used in read_csv2()
.
Please consider to use the point "." as standard decimal separator if the separator is ";".
Of course it is a quick thing to add the parameter by hand when I use import()
; however, I work a lot with students being introduced to R. It's a constand hinderance for them to correctly get the decimal and column separator.
This feature would be a plus above eg., read_csv2()
.
See this example:
library(rio)
data("mtcars")
mtcars$hp[1] <- 100.1
m2 <- write.csv2(mtcars, "mtcars2.csv")
m2_ohno <- rio::import("mtcars2.csv")
glimpse(m2_ohno)
str(m2_ohno$hp) #string, not a number
chr [1:32] "100,1" "110" "93" "110" "175" "105" "245" "62" "
## session info for your system
sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rio_0.5.10 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6 purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] tidyselect_0.2.4 reshape2_1.4.3 haven_1.1.2 lattice_0.20-35 colorspace_1.3-2 yaml_2.1.19 rlang_0.2.1 pillar_1.2.3 foreign_0.8-70 glue_1.2.0.9000 withr_2.1.2 modelr_0.1.2
[13] readxl_1.1.0 bindrcpp_0.2.2 bindr_0.1.1 plyr_1.8.4 munsell_0.5.0 gtable_0.2.0 cellranger_1.1.0 zip_1.0.0 rvest_0.3.2 psych_1.8.4 knitr_1.20 parallel_3.5.0
[25] curl_3.2 broom_0.4.5 Rcpp_0.12.17 clipr_0.4.1 scales_0.5.0 jsonlite_1.5 mnormt_1.5-5 hms_0.4.2 stringi_1.2.3 openxlsx_4.1.0 grid_3.5.0 cli_1.0.0
[37] tools_3.5.0 magrittr_1.5 lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.1 data.table_1.11.4 xml2_1.2.0 datapasta_3.0.0 lubridate_1.7.4 assertthat_0.2.0 httr_1.3.1 rstudioapi_0.7
[49] R6_2.2.2 nlme_3.1-137 compiler_3.5.0
This is already supported, you just need to specify format
. Default is "US-style" CSV but you can "Euro-style" CSV with either of:
> str(rio::import("mtcars2.csv", format = "csv2")$hp)
num [1:32] 100 110 93 110 175 ...
> str(rio::import("mtcars2.csv", format = ";")$hp)
num [1:32] 100 110 93 110 175 ...
Thanks for the clarification.
Is there a way to rio::import
a CSV without knowing in advance what format
is? Like: "Gimme a csv, don't tell me its format, I will import it correctly".
@leeper can check me here, but:
That should be handled under the hood by data.table::fread()
, which rio::import()
wraps as its call to delimited files. The flat-file methods in this package end up passing sep = "auto"
to fread()
, regardless of the specific file extension, which will try & determine the format automatically.
However, it looks like there's a stop()
call if the file has no extension, so if you don't know the actual file format (i.e. delimiter), and the file has no extension, just give the file a "pretend" .csv
extension, and import()
will still determine the best format via fread()
. Or, leave the extension off and pass format = "csv"
, to the same effect.
Hi @ryapric,
it appears that this auto detection does not work. Consider the following example where a "100.1" (as numeric) is saved in csv2
format. rio:import()
does not recknognize the decimal sepatator (".") correctly in this case.
library(rio)
data("mtcars")
mtcars$hp[1] <- 100.1
m2 <- write.csv2(mtcars, "mtcars2.csv")
m2_ohno <- rio::import("mtcars2.csv")
glimpse(m2_ohno)
str(m2_ohno$hp) #string, not a number
chr [1:32] "100,1" "110" "93" "110" "175" "105" "245" "62"
Let me know if I'm missing out something. In any case, it's not essential.
I don't have time to look into this in detail at the moment, but the challenge here is really a format ambiguity issue. Comma-style and semicolon-style CSVs are ultimately two different formats that are given the same file extension. fread()
tries to deal with these but to dispatch to fread()
, we need to have an idea of what kind of file it is and we can't tell from the file extension (which is the only info we have unless you specify format
). So making everything seamless is a bit hard.
I agree that this is not easy.
Maybe a way could be to write a function guess_deliminator()
that reads the first 1000 lines and counts commas and semicolons. Whatever is more frequent is suggested as deliminator. Thoughts in progress.
My inclination is not to do that, given that fread()
(which is what rio is ultimately using for the import) is all about speed. Feels a bit awkward to read in part of the file only to then read in the rest of the file using fread()
.
Even readr::read_csv()
(which actually does parse the first 1000 rows to detect column types) fails on this use case (and readr::read_csv2()
succeeds, just as rio::import(... , format = "csv2")
, etc. do).
It's apparently a quite tricky problem. Again, this is because ".csv" is ambiguous and humans might be better at resolving that ambiguity than computers.
That's reasonable. Maybe I come up with a function such as guess_format()
. But maybe it's not worth the fuss...