rio icon indicating copy to clipboard operation
rio copied to clipboard

Problems with IDate data type after readind a CSV with import

Open jllipatz opened this issue 2 years ago • 1 comments

Hello,

I am experiencing non deterministic problems with a data frame containing an IDate column. In the following example I kept only two columns of the original file (donnees-hospitalieres-covid19-2021-12-08-19h05.csv or the same for a different date from https://www.data.gouv.fr/fr/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/). The file was read using import without additional option. Replacing the call to dplyr by df2 <- df[df$sexe==0,] make the things go well again. So I don't really know where is the problem : in rio, in dplyr or in the IDate type from data.table.

If the problem is related with the IDate type wouldn't it be possible to use the standard type Date instead, overriding what fread did?

From a fresh session under R 4.1.1 :

> library(dplyr)

Attachement du package : ‘dplyr’

Les objets suivants sont masqués depuis ‘package:stats’:

    filter, lag

Les objets suivants sont masqués depuis ‘package:base’:

    intersect, setdiff, setequal, union

> df <- readRDS("U:/PbIDate.RDS")
> str(df)
'data.frame':   191540 obs. of  2 variables:
 $ sexe: int  0 1 2 0 1 2 0 1 2 0 ...
 $ jour: IDate, format: "2020-03-18" "2020-03-18" ...
> df2 <- filter(df,sexe==0)
> df$b <- df$jour-365
> df2$b <- df2$jour-365

> library(rio)
The following rio suggested packages are not installed: ‘arrow’, ‘hexView’, ‘jsonlite’, ‘pzfx’, ‘readODS’, ‘rmarkdown’, ‘rmatio’
Use 'install_formats()' to install them
> df <- readRDS("U:/PbIDate.RDS")
> str(df)
'data.frame':   191540 obs. of  2 variables:
 $ sexe: int  0 1 2 0 1 2 0 1 2 0 ...
 $ jour: IDate, format: "2020-03-18" "2020-03-18" ...
> df2 <- filter(df,sexe==0)
> df$b <- df$jour-365
> df2$b <- df2$jour-365
Erreur dans `-.IDate`(df2$jour, 365) : 
  Internal error: storage mode of IDate is somehow no longer integer

jllipatz avatar Dec 09 '21 11:12 jllipatz

I can't reproduce this issue:

library(rio)
download.file('https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7',
destfile = 'covid.csv')
df <-import('covid.csv')
df$jour
sapply(df,class)
export(df, format = 'RDS')
import('df.rds')
df <- import('df.rds')
df$jour

All works fine for me.

R version 4.1.0 (2021-05-18)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 12.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rio_0.5.29

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3      fansi_1.0.3       utf8_1.2.2        crayon_1.5.1
 [5] cellranger_1.1.0  lifecycle_1.0.1   magrittr_2.0.2    zip_2.2.0
 [9] pillar_1.7.0      stringi_1.7.6     rlang_1.0.2       cli_3.2.0
[13] readxl_1.3.1      curl_4.3.2        data.table_1.14.2 vctrs_0.3.8
[17] ellipsis_0.3.2    openxlsx_4.2.5    tools_4.1.0       forcats_0.5.1
[21] foreign_0.8-81    glue_1.6.2        hms_1.1.1         compiler_4.1.0
[25] pkgconfig_2.0.3   haven_2.4.1       tibble_3.1.6

jsonbecker avatar May 04 '22 21:05 jsonbecker

The problem is reproducible but I think it is a problem of dplyr. Something similar in data.table was discussed here: https://github.com/Rdatatable/data.table/issues/2008

library(rio)
download.file('https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7',
destfile = 'covid.csv')
df <-import('covid.csv')
export(df, "df.RDS")
df <- import('df.RDS')
df$b <- df$jour-365
df2 <- dplyr::filter(df,sexe==0)
df2$b <- df2$jour-365
#> Error in `-.IDate`(df2$jour, 365): Internal error: storage mode of IDate is somehow no longer integer
storage.mode(df$jour)
#> [1] "integer"
storage.mode(df2$jour)
#> [1] "double"

Created on 2023-09-11 with reprex v2.0.2

here is a minimal example that is independent of rio

library(data.table)
library(dplyr)

df <- data.table(a=Sys.Date(),b=14)
df$a-365
#> [1] "2022-09-11"
storage.mode(df$a)
#> [1] "double"

tb <- as_tibble(df)
tb$a-365
#> [1] "2022-09-11"
storage.mode(tb$a)
#> [1] "double"

df$a <- as.IDate(df$a)
df$a-365
#> [1] "2022-09-11"
storage.mode(df$a)
#> [1] "integer"


tb <- as_tibble(df) |> dplyr::filter(a>=Sys.Date())
tb$a-365
#> Error in `-.IDate`(tb$a, 365): Internal error: storage mode of IDate is somehow no longer integer
storage.mode(tb$a)
#> [1] "double"

Created on 2023-09-11 with reprex v2.0.2

@chainsawriot Dn't think we need to do anything in rio, but is this something to escalate to the dplyr team?

Edit: see https://github.com/tidyverse/dplyr/issues/6687 and https://github.com/tidyverse/dplyr/issues/6230

schochastics avatar Sep 11 '23 07:09 schochastics

Ok this appears to be an open issue in vctrs: https://github.com/r-lib/vctrs/issues/1781 I will close this here

schochastics avatar Sep 11 '23 08:09 schochastics

@schochastics Thank you very much for the investigation.

I tried dtplyr also, the same.

chainsawriot avatar Sep 11 '23 08:09 chainsawriot

Oh fascinating!

jsonbecker avatar Sep 11 '23 14:09 jsonbecker