readr icon indicating copy to clipboard operation
readr copied to clipboard

Introduce `default_locale2()`

Open dpprdan opened this issue 3 years ago • 3 comments

I’ve recently run into the read_csv2() with dot as decimal separator issue that’s been mentioned here several times before. Basically the problem is, that read_csv2() cannot handle csv files with semicolon as delimiter and the dot as decimal separator.

library(readr)
read_csv2(I("a;b\n1.0;2.0"), locale = locale(decimal_mark = "."))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> num (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1    10    20

The reply so far has been (paraphrasing) “This is not what read_csv2() is intended for, because you don’t have a comma as decimal separator. Use read_delim(delim =”;“) instead.”

I think this reasoning and its current implementation are problematic.

First, overriding the locale() setting is quite smelly (sorry, Hadley!)

https://github.com/tidyverse/readr/blob/41447cec0d78337c316e5dd109a306101ce6e2ce/R/read_delim.R#L333-L337

Due to that, read_csv2() does not apply custom decimal_mark and grouping_mark locale settings. Either the dot, as mentioned above, but also the grouping_mark is ignored with read_csv2():

read_csv2(I("a;b\n1,8;2'345"), locale = locale(grouping_mark = "'"))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> chr (1): b
#> dbl (1): a
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a b    
#>   <dbl> <chr>
#> 1   1.8 2'345

compare

read_csv(I("a,b\n1.8,2'345"), locale = locale(grouping_mark = "'"))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (1): a
#> num (1): b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.8  2345

In addition, the warning message is confusing (E.g. it is not clear that this is indeed an override) and easily overlooked in the other messaging (Can you spot it immediately above? It’s not obvious IMO even with this minimal column spec).

Even worse, this warning is always shown by default. See the read_csv2() example:

read_csv2(I("a;b\n1,0;2,0"))
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

This is because default_locale() implies decimal_mark = ".", which read_csv2() then has to override and throw the warning. But this essentially means that the locale = default_locale() default in read_csv2() is useless, because to get rid of the warning, you have to define a different locale() anyway.

read_csv2(I("a;b\n1,0;2,0"), locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ";"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

Conceptually, it would be more sound to treat read_csv2() the same as read_csv() and read_tsv():

read_csv() and read_tsv() are special cases of the more general read_delim(). They’re useful for reading the most common types of flat file data, comma separated values and tab separated values, respectively.

I.e. set the delimiter, but grant flexibility on the decimal separator.

E.g. with read_tsv() you can easily do the following

read_tsv(I("a\tb\n1,1\t2,2"), locale = locale(decimal_mark = ","))
#> Rows: 1 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 × 2
#>       a     b
#>   <dbl> <dbl>
#> 1   1.1   2.2

read_csv2() would need a different locale/decimal separator default of course (but would need that anyway, because the current default is useless). Either a default_locale2() or ~~locale(decimal_mark = ",")~~ locale(decimal_mark = ",", grouping_mark = ".").

Is that something you would consider? Happy to draft a PR.

One last thing:

This format is common in some European countries. https://readr.tidyverse.org/reference/read_delim.html

It’s not just “some European countries”. Half the world uses the comma as a decimal separator and while I don’t have any numbers on the semicolon as delimiter in CSVs, it is clear that it cannot be the comma in those countries. Maybe it’s just me, but frankly that line also sounds a tad dismissive to me - I know it’s not meant that way, but still.

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2022-10-27
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.2.0)
#>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.2.0)
#>  cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.1)
#>  crayon        1.5.2   2022-09-29 [1] RSPM
#>  digest        0.6.30  2022-10-18 [1] CRAN (R 4.2.1)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate      0.17    2022-10-07 [1] RSPM
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms           1.1.2   2022-08-19 [1] CRAN (R 4.2.1)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  knitr         1.40    2022-08-24 [1] CRAN (R 4.2.1)
#>  lifecycle     1.0.3   2022-10-07 [1] RSPM
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         0.3.5   2022-10-06 [1] RSPM
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.3   2022-10-01 [1] CRAN (R 4.2.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.17    2022-10-07 [1] RSPM
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
#>  styler        1.8.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyselect    1.2.0   2022-10-10 [1] RSPM
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs         0.5.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  vroom         1.6.0   2022-09-30 [1] RSPM
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.34    2022-10-18 [1] CRAN (R 4.2.1)
#>  yaml          2.3.6   2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/Daniel/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

dpprdan avatar Oct 27 '22 16:10 dpprdan

Edit: This was/is a separate, albeit related, issue #1468.

dpprdan avatar Oct 27 '22 18:10 dpprdan

@phish108 has brought to my attention that institutions in Switzerland, like the Federal Statistical Office, publish at least some CSV files with semicolon as delimiter and dot as decimal sign.

Here is one example (metadata)

library(readr)
ch_url <- "https://www.bfs.admin.ch/bfsstatic/dam/assets/21765619/master"
read_lines(ch_url, n_max = 5)
#> [1] "\"TIME_PERIOD\";\"GEO\";\"POP\";\"ERWP\";\"ERWL\";\"UNIT_MEA\";\"OBS_VALUE\";\"OBS_CONFIDENCE\";\"OBS_STATUS\""
#> [2] "\"2010/2012\";\"101\";\"Total\";\"0\";\"Total\";\"pers\";12374.03;5.620;\"A\""                                 
#> [3] "\"2010/2012\";\"101\";\"Total\";\"1\";\"Total\";\"pers\";28394.11;3.718;\"A\""                                 
#> [4] "\"2010/2012\";\"101\";\"Total\";\"Total\";\"Total\";\"pers\";40768.14;3.074;\"A\""                             
#> [5] "\"2010/2012\";\"102\";\"Total\";\"0\";\"Total\";\"pers\";7288.64;7.321;\"A\""

another example (metadata)

ch_url2 <- "https://www.web.statistik.zh.ch/ogd/data/KANTON_ZUERICH_375.csv"
read_lines(ch_url2, n_max = 5)
#> [1] "BFS_NR;GEBIET_NAME;THEMA_NAME;SET_NAME;SUBSET_NAME;INDIKATOR_ID;INDIKATOR_NAME;INDIKATOR_JAHR;INDIKATOR_VALUE;EINHEIT_KURZ;EINHEIT_LANG;"            
#> [2] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1990;2.6;Mio.Fr.;Millionen Franken;"
#> [3] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1991;3.0;Mio.Fr.;Millionen Franken;"
#> [4] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1992;3.3;Mio.Fr.;Millionen Franken;"
#> [5] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1993;3.6;Mio.Fr.;Millionen Franken;"

Apparently this spec is not used consistently. Here is an example from the Statistical Office where they use comma as delimiter (metadata)

ch_url3 <- "https://dam-api.bfs.admin.ch/hub/api/dam/assets/24106318/master"
read_lines(ch_url3, n_max = 5)
#> [1] "\"REGION\",\"CANTON\",\"PERIOD\",\"VALUE\",\"STATUS\",\"OBS_COEF\""
#> [2] "\"Total\",\"Total\",\"2023-01\",\"2.19208880770041\",\"A\",\"A\""  
#> [3] "\"1\",\"Total\",\"2023-01\",\"3.47967832561423\",\"A\",\"A\""      
#> [4] "\"1\",\"22\",\"2023-01\",\"3.49266747468504\",\"A\",\"A\""         
#> [5] "\"1\",\"23\",\"2023-01\",\"2.97652113777818\",\"A\",\"A\""

But be that as it may, why not support semicolon as delimiter and dot as decimal sign with read_csv2(), when even official institutions publish files like that?

dpprdan avatar Feb 26 '23 21:02 dpprdan

Should fix at same time as #1468

hadley avatar Aug 01 '23 12:08 hadley