RSQLite icon indicating copy to clipboard operation
RSQLite copied to clipboard

CAST(NULL AS type) does not cast missing data NA to the expected type

Open gavril0 opened this issue 3 years ago • 2 comments

It seems impossible to cast NA when a new variable containing only NA is created within SQL. More precisely, CAST(expr AS type) in a SELECT query always returns logical NA when the expression contains only NA/NULL values.

As a result, it does not appear possible to prevent the NA type for a new variable to be logical if the subset of data involved in the computation has only missing data.

Example 1

library(RSQLite)
library(DBI)
con <- dbConnect(SQLite(), ":memory:")

# create a data frame with numeric and character NA
df <- data.frame(num=as.numeric(NA),text=as.character(NA))

# load in the data base
dbWriteTable(con, "test", value = df, overwrite = TRUE )

# NA type corresponds to column types when recovering 
rs <- dbGetQuery(con, "SELECT * FROM test") )

str(rs) 'data.frame': 1 obs. of 2 variables: $ num : num NA $ text: chr NA

So far, it behaves as expected. See following comment

  • The first non-NULL value decides the type of the column.
  • If all values are NULL, the result is a vector of NAs of the declared data type for the column, or logical.
# compute a new variable from NA value
rs <- str( dbGetQuery(con, "SELECT num, num+1 AS new FROM test") )

str(rs) 'data.frame': 1 obs. of 2 variables: $ num : num NA $ new: logi NA

When creating a new variable, the data type of the new variable is logical, which might be ok in this case (but this makes less sense in the second example)

rs <- dbGetQuery(con, "SELECT CAST(num+1 AS INTEGER) AS new FROM test") 

str(rs) 'data.frame': 1 obs. of 1 variable: $ new: logi NA

CAST(NULL as type) does not work in this context. The NA is still logical:

 # Casting a new column of NA/NULL does not work
rs<- dbGetQuery(con,  "SELECT CAST(NULL AS INTEGER) AS new_num, CAST(NULL AS TEXT) new_text FROM test") 

str(rs) 'data.frame': 1 obs. of 2 variables: $ new_num : logi NA $ new_text: logi NA

Example 2

The above example is contrived but the issue of the return type for NA is particularly annoying when there is a NA is some records but not other:

df2 <- data.frame(id = c(1,2),  num = c(1, NA), text = c("a", NA) )
 dbWriteTable(con, "test2", value = df2, overwrite = TRUE )

# compute some extra variables
rs <- dbGetQuery(con, "SELECT *, num+1 AS new_num,  text||'b' AS new_text FROM test2")

str(rs) 'data.frame': 2 obs. of 5 variables: $ id : num 1 2 $ num : num 1 NA $ text : chr "a" NA $ new_num : num 2 NA $ new_text: chr "ab" NA

Note that the extra variables have the expected types. However, the type is logical when the same operation is executed only on the second record:

# compute some extra variables when there missing data
rs <- dbGetQuery(con, "SELECT *, num+1 AS new_num,  text||'b' AS new_text FROM test2 WHERE id=2")

str(rs) 'data.frame': 1 obs. of 5 variables: $ id : num 2 $ num : num NA $ text : chr NA $ new_num : logi NA $ new_text: logi NA

Note that this is different from R behaviour:

tmp <- df2[df2$id==2,]
tmp$new_num <- tmp$num+1

str(tmp) 'data.frame': 1 obs. of 4 variables: $ id : num 2 $ num : num NA $ text : chr NA $ new_num: num NA

I understand that there is a default toward logical when a new variable contains only NA but I think that it should be possible to obtain a column of the desired type even if it contains only missing data in one way or another.

gavril0 avatar Mar 21 '22 21:03 gavril0

The issue was closed by mistake.

gavril0 avatar Mar 21 '22 22:03 gavril0

Thanks. Simpler reprex:

library(RSQLite)
library(DBI)

con <- dbConnect(SQLite(), ":memory:")

df2 <- data.frame(id = c(1, 2), num = c(1, NA))
dbWriteTable(con, "test2", value = df2, overwrite = TRUE)

rs <- dbGetQuery(con, "SELECT *, num+1 AS new_num FROM test2 WHERE id=2")
str(rs)
#> 'data.frame':    1 obs. of  3 variables:
#>  $ id     : num 2
#>  $ num    : num NA
#>  $ new_num: logi NA

Created on 2022-03-23 by the reprex package (v2.0.1)

Looks like the issue here is that sqlite doesn't advertise to us the type of the num + 1 expression. Data-wise all NULL values look identical (they come back as SQLITE_NULL). There seems little we can do here.

If you're free to choose the database engine, consider DuckDB as a type-stable replacement.

krlmlr avatar Mar 23 '22 01:03 krlmlr