rio
rio copied to clipboard
Problems with IDate data type after readind a CSV with import
Hello,
I am experiencing non deterministic problems with a data frame containing an IDate column. In the following example I kept only two columns of the original file (donnees-hospitalieres-covid19-2021-12-08-19h05.csv or the same for a different date from https://www.data.gouv.fr/fr/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/). The file was read using import
without additional option.
Replacing the call to dplyr by df2 <- df[df$sexe==0,]
make the things go well again. So I don't really know where is the problem : in rio, in dplyr or in the IDate type from data.table.
If the problem is related with the IDate type wouldn't it be possible to use the standard type Date instead, overriding what fread did?
From a fresh session under R 4.1.1 :
> library(dplyr)
Attachement du package : ‘dplyr’
Les objets suivants sont masqués depuis ‘package:stats’:
filter, lag
Les objets suivants sont masqués depuis ‘package:base’:
intersect, setdiff, setequal, union
> df <- readRDS("U:/PbIDate.RDS")
> str(df)
'data.frame': 191540 obs. of 2 variables:
$ sexe: int 0 1 2 0 1 2 0 1 2 0 ...
$ jour: IDate, format: "2020-03-18" "2020-03-18" ...
> df2 <- filter(df,sexe==0)
> df$b <- df$jour-365
> df2$b <- df2$jour-365
> library(rio)
The following rio suggested packages are not installed: ‘arrow’, ‘hexView’, ‘jsonlite’, ‘pzfx’, ‘readODS’, ‘rmarkdown’, ‘rmatio’
Use 'install_formats()' to install them
> df <- readRDS("U:/PbIDate.RDS")
> str(df)
'data.frame': 191540 obs. of 2 variables:
$ sexe: int 0 1 2 0 1 2 0 1 2 0 ...
$ jour: IDate, format: "2020-03-18" "2020-03-18" ...
> df2 <- filter(df,sexe==0)
> df$b <- df$jour-365
> df2$b <- df2$jour-365
Erreur dans `-.IDate`(df2$jour, 365) :
Internal error: storage mode of IDate is somehow no longer integer
I can't reproduce this issue:
library(rio)
download.file('https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7',
destfile = 'covid.csv')
df <-import('covid.csv')
df$jour
sapply(df,class)
export(df, format = 'RDS')
import('df.rds')
df <- import('df.rds')
df$jour
All works fine for me.
R version 4.1.0 (2021-05-18)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 12.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rio_0.5.29
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 fansi_1.0.3 utf8_1.2.2 crayon_1.5.1
[5] cellranger_1.1.0 lifecycle_1.0.1 magrittr_2.0.2 zip_2.2.0
[9] pillar_1.7.0 stringi_1.7.6 rlang_1.0.2 cli_3.2.0
[13] readxl_1.3.1 curl_4.3.2 data.table_1.14.2 vctrs_0.3.8
[17] ellipsis_0.3.2 openxlsx_4.2.5 tools_4.1.0 forcats_0.5.1
[21] foreign_0.8-81 glue_1.6.2 hms_1.1.1 compiler_4.1.0
[25] pkgconfig_2.0.3 haven_2.4.1 tibble_3.1.6
The problem is reproducible but I think it is a problem of dplyr. Something similar in data.table was discussed here: https://github.com/Rdatatable/data.table/issues/2008
library(rio)
download.file('https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7',
destfile = 'covid.csv')
df <-import('covid.csv')
export(df, "df.RDS")
df <- import('df.RDS')
df$b <- df$jour-365
df2 <- dplyr::filter(df,sexe==0)
df2$b <- df2$jour-365
#> Error in `-.IDate`(df2$jour, 365): Internal error: storage mode of IDate is somehow no longer integer
storage.mode(df$jour)
#> [1] "integer"
storage.mode(df2$jour)
#> [1] "double"
Created on 2023-09-11 with reprex v2.0.2
here is a minimal example that is independent of rio
library(data.table)
library(dplyr)
df <- data.table(a=Sys.Date(),b=14)
df$a-365
#> [1] "2022-09-11"
storage.mode(df$a)
#> [1] "double"
tb <- as_tibble(df)
tb$a-365
#> [1] "2022-09-11"
storage.mode(tb$a)
#> [1] "double"
df$a <- as.IDate(df$a)
df$a-365
#> [1] "2022-09-11"
storage.mode(df$a)
#> [1] "integer"
tb <- as_tibble(df) |> dplyr::filter(a>=Sys.Date())
tb$a-365
#> Error in `-.IDate`(tb$a, 365): Internal error: storage mode of IDate is somehow no longer integer
storage.mode(tb$a)
#> [1] "double"
Created on 2023-09-11 with reprex v2.0.2
@chainsawriot Dn't think we need to do anything in rio, but is this something to escalate to the dplyr team?
Edit: see https://github.com/tidyverse/dplyr/issues/6687 and https://github.com/tidyverse/dplyr/issues/6230
Ok this appears to be an open issue in vctrs: https://github.com/r-lib/vctrs/issues/1781 I will close this here
@schochastics Thank you very much for the investigation.
I tried dtplyr
also, the same.
Oh fascinating!