DatabaseConnector icon indicating copy to clipboard operation
DatabaseConnector copied to clipboard

Output to tibble?

Open schuemie opened this issue 6 years ago • 10 comments

R data frames are pretty horrible, with lots of automatic conversions that break your code if you're not careful (e.g. 1-row data frames become vectors, strings become factors, etc.)

Perhaps querySql should output tibbles instead? Or should we add a querySql.tibble function?

schuemie avatar Feb 13 '19 10:02 schuemie

I prefer option 2 (adding querySql.tibble), since many OHDSI tools are based on querySql. And some old functions and '[' might not work on tibbles.

chandryou avatar Feb 13 '19 10:02 chandryou

I think type stability is an important principle for non-interactive R code. Functions in study packages should return a predictable type or fail. [.data.frame is not type stable. sapply is another culprit. tibble:::[.tbl_df is type stable. I also think there are lots of cool things we could do with list columns (columns of a tibble containing arbitrary R objects).

ablack3 avatar Aug 29 '20 20:08 ablack3

@ablack3 explanation is spot on. I am in support of standardizing to tibble

gowthamrao avatar Aug 29 '20 20:08 gowthamrao

I agree tibbles are preferred. But to return to the original question: do we make querySql return tibbles, or do we add an additional function? Switching querySql to tibbles will likely break people's code, with no way to gracefully transition from the old to the new code (e.g. we can't use deprecated warnings). On the flip side, I would hate having querySql.tibble next to querySql for eternity.

schuemie avatar Aug 31 '20 05:08 schuemie

I think the default class should be tibble for all ohdsi R functions that returns dataframe. So, i support changing querySql to return tibble. Is there a backward compatibility issue that would happen if we do so?

gowthamrao avatar Aug 31 '20 12:08 gowthamrao

One issue I can think of is that this returns a vector:

cars[, "speed"]

whereas this returns a tibble:

cars <- as.tibble(cars)
cars[, "speed"]

A vector has 1 dimension, a tibble 2, so downstream code would need to be different.

schuemie avatar Aug 31 '20 12:08 schuemie

Here is one example of an issue that might arise.

library(tibble)
targetIds <- sample(1:10, 5)

targetStrataXref <- data.frame(targetId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
tsXrefSubset
#> [1]  5  7  8  9 10

targetStrataXref <- tibble(targetId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
tsXrefSubset
#> # A tibble: 5 x 1
#>   targetId
#>      <int>
#> 1        5
#> 2        7
#> 3        8
#> 4        9
#> 5       10

One benefit of using tibbles is more strict and predictable behavior overall. For example if there is a typo in a column name it will be caught earlier.

library(tibble)
targetIds <- sample(1:10, 5)

targetStrataXref <- data.frame(targtId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
tsXrefSubset
#> integer(0)

targetStrataXref <- tibble(targtId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
#> Warning: Unknown or uninitialised column: `targetId`.
#> Error: Must subset rows with a valid subscript vector.
#> ℹ Logical subscripts must match the size of the indexed input.
#> x Input has size 10 but subscript `targetStrataXref$targetId %in% targetIds` has size 0.
tsXrefSubset
#> integer(0)

ablack3 avatar Aug 31 '20 12:08 ablack3

I propose we change the output of querySql, and not add a querySql.tibble function.

This will be a breaking change, so should only be included in a major release. We just had a major release, so I'll put it under the milestone for the next major release, but that might be a while.

schuemie avatar Sep 01 '20 09:09 schuemie

@schuemie would you consider this something that should be standardized across all OHDSI R packages i.e. a HADES convention that outputs should be type stable, dataframe should be tibble?

gowthamrao avatar Sep 01 '20 09:09 gowthamrao

An example of tibble vs data frame here

https://github.com/OHDSI/DatabaseConnector/blob/12332c0d56c1de53df6ca9009e95364a59124076/R/InsertTable.R#L420

in code above, is data' is a tibble - column stays a tibble; but if data` is a data frame, column becomes a vector (a change in type). This 'side effect' was used in this code - but when input became tibble it caused an error

gowthamrao avatar Sep 07 '20 15:09 gowthamrao