DatabaseConnector
DatabaseConnector copied to clipboard
Output to tibble?
R data frames are pretty horrible, with lots of automatic conversions that break your code if you're not careful (e.g. 1-row data frames become vectors, strings become factors, etc.)
Perhaps querySql
should output tibbles instead? Or should we add a querySql.tibble
function?
I prefer option 2 (adding querySql.tibble), since many OHDSI tools are based on querySql. And some old functions and '[' might not work on tibbles.
I think type stability is an important principle for non-interactive R code. Functions in study packages should return a predictable type or fail. [.data.frame
is not type stable. sapply
is another culprit.
tibble:::[.tbl_df
is type stable. I also think there are lots of cool things we could do with list columns (columns of a tibble containing arbitrary R objects).
@ablack3 explanation is spot on. I am in support of standardizing to tibble
I agree tibbles are preferred. But to return to the original question: do we make querySql
return tibbles, or do we add an additional function? Switching querySql
to tibbles will likely break people's code, with no way to gracefully transition from the old to the new code (e.g. we can't use deprecated warnings). On the flip side, I would hate having querySql.tibble
next to querySql
for eternity.
I think the default class should be tibble for all ohdsi R functions that returns dataframe. So, i support changing querySql
to return tibble.
Is there a backward compatibility issue that would happen if we do so?
One issue I can think of is that this returns a vector:
cars[, "speed"]
whereas this returns a tibble:
cars <- as.tibble(cars)
cars[, "speed"]
A vector has 1 dimension, a tibble 2, so downstream code would need to be different.
Here is one example of an issue that might arise.
library(tibble)
targetIds <- sample(1:10, 5)
targetStrataXref <- data.frame(targetId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
tsXrefSubset
#> [1] 5 7 8 9 10
targetStrataXref <- tibble(targetId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
tsXrefSubset
#> # A tibble: 5 x 1
#> targetId
#> <int>
#> 1 5
#> 2 7
#> 3 8
#> 4 9
#> 5 10
One benefit of using tibbles is more strict and predictable behavior overall. For example if there is a typo in a column name it will be caught earlier.
library(tibble)
targetIds <- sample(1:10, 5)
targetStrataXref <- data.frame(targtId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
tsXrefSubset
#> integer(0)
targetStrataXref <- tibble(targtId = 1:10)
tsXrefSubset <- targetStrataXref[targetStrataXref$targetId %in% targetIds, ]
#> Warning: Unknown or uninitialised column: `targetId`.
#> Error: Must subset rows with a valid subscript vector.
#> ℹ Logical subscripts must match the size of the indexed input.
#> x Input has size 10 but subscript `targetStrataXref$targetId %in% targetIds` has size 0.
tsXrefSubset
#> integer(0)
I propose we change the output of querySql
, and not add a querySql.tibble
function.
This will be a breaking change, so should only be included in a major release. We just had a major release, so I'll put it under the milestone for the next major release, but that might be a while.
@schuemie would you consider this something that should be standardized across all OHDSI R packages i.e. a HADES convention that outputs should be type stable, dataframe should be tibble?
An example of tibble vs data frame here
https://github.com/OHDSI/DatabaseConnector/blob/12332c0d56c1de53df6ca9009e95364a59124076/R/InsertTable.R#L420
in code above, is data' is a tibble - column stays a tibble; but if
data` is a data frame, column becomes a vector (a change in type). This 'side effect' was used in this code - but when input became tibble it caused an error