duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

Request for duckdb_read_csv2 function and clarification on data types in duckdb_read_csv

Open VincentGuyader opened this issue 1 year ago • 12 comments

Hello everyone,

I have encountered an issue while trying to import a CSV file using DuckDB in R. I am using the duckdb_read_csv function to read the CSV file into a DuckDB database. However, I noticed that the function does not accurately detect the data types of the columns.

library(DBI)
library(duckdb)
library(readr)

con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:", read_only = FALSE)
readr::write_csv2(iris, "iris.csv")
duckdb::duckdb_read_csv(con,
                        name = "iris_data1",
                        files = 'iris.csv',
                        sep = ";",
                        dec = "."
)
iris_db1 <- dbReadTable(con, "iris_data1")
is.numeric(iris_db1$Sepal.Length)



The issue arises when I try to check whether a column is numeric using is.numeric. Even though the column 'Sepal.Length' should be numeric, the result of is.numeric(iris_db1$Sepal.Length) is FALSE.

I believe one potential solution to this issue would be to have a new function duckdb_read_csv2 that allows users to specify column data types explicitly or to improve the data type detection in duckdb_read_csv.

Therefore, I would like to request the addition of a duckdb_read_csv2 function with enhanced data type detection capabilities or the improvement of data type detection in the existing duckdb_read_csv function. it seems that the parameters delim, dec, and sep are not being recognized by the duckdb_read_csv function

Regards

VincentGuyader avatar Mar 22 '24 15:03 VincentGuyader

Thanks. This is what I see:

library(DBI)
library(duckdb)
library(readr)

con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:", read_only = FALSE)
readr::write_csv2(iris, "iris.csv")
duckdb::duckdb_read_csv(con, name = "iris_data1", files = "iris.csv")
#> Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
iris_db1 <- dbReadTable(con, "iris_data1")
#> Error: rapi_prepare: Failed to prepare query SELECT * FROM iris_data1
#> Error: Catalog Error: Table with name iris_data1 does not exist!
#> Did you mean "temp.information_schema.schemata"?
#> LINE 1: SELECT * FROM iris_data1
#>                       ^
iris_db1
#> Error in eval(expr, envir, enclos): object 'iris_db1' not found

Created on 2024-03-23 with reprex v2.1.0

Can you please fix the example?

krlmlr avatar Mar 23 '24 06:03 krlmlr

thanks. I have edited the Reprex

VincentGuyader avatar Mar 23 '24 12:03 VincentGuyader

Thanks, I see it now.

The decimal separator seems to be fixed to "." in DuckDB's COPY TO, see https://duckdb.org/docs/sql/statements/copy.html. We'll need to support that in the DuckDB core before being able to do anything here.

Your best bet may be to use more mundane methods to convert the CSV data to Parquet, and to use that from DuckDB.

krlmlr avatar Mar 24 '24 16:03 krlmlr

I now see in https://duckdb.org/docs/data/csv/overview that decimal_separator is a supported setting. Would you like to contribute a PR?

krlmlr avatar Apr 24 '24 07:04 krlmlr

Hi, a col_types type parameter is something I'd really like to see from this package. I'd be happy to provide a PR if @VincentGuyader doesn't have time. Is there are reason why that the current iteration uses the CREATE TABLE and then COPY TO syntax instead of justCREATE TABLE x FROM read_csv()? As that second method is the one I would have used had I not read the existing code.

eli-daniels avatar Aug 08 '24 12:08 eli-daniels

Thanks, Eli. Often it's wise to separate table creation from population, think about partitioned CSV files.

Going via dbplyr would be another option, using tbl_function() and compute(temporary = FALSE) .

krlmlr avatar Aug 16 '24 20:08 krlmlr

Would adding dbplyr as a dependency be a good idea though? It seems like everything is based off DBI for the moment. I'm planning to make better use of duckdb's sniff_csv instead of read.csv and duckdb's read_csv function to insert rows in a table created with dbCreateTable as suggested here https://github.com/duckdb/duckdb-r/issues/142.

Does that sound reasonable @krlmlr ?

eli-daniels avatar Sep 11 '24 12:09 eli-daniels

Thanks for the PR, Eli. I'd rather not depend on dbplyr here.

Since the implementation of this function, duckdb's features for CSV ingestion have advanced substantially, see https://duckdb.org/docs/data/csv/overview:

  • Sniffing
  • CREATE TABLE ... AS read_csv(...)

I wonder what a good ingestion API that separates the sniffing, table creation, and table import steps looks like. Perhaps the duckdbfs package is a better fit? CC @cboettig.

krlmlr avatar Oct 15 '24 06:10 krlmlr

Hi, Yea, I think the csv sniffer and reader in DuckDB is good enough to stand on its own two feet. The error handling is already great so error handling in the API should be minimal.

I've sketched out the below, this works but probably needs more tests if PR'd. The function takes the dots arguments and passes it onto DuckDB SQL's read_csv() and builds it into a CREATE TABLE .. FROM read_csv( ... ) statement. This relies on users knowing the DuckDB docs a bit like data types and arguments.

I think this would be the easiest version to maintain in the long-run as it should adapt pretty easily to changes to DuckDB down the road.

Let me know what you think.

# Note I had to download the latest version of DuckDB from r-universe or the multi-file import crashed my R session.
# install.packages("duckdb", repos = c("https://duckdb.r-universe.dev", "https://cloud.r-project.org"))

library(duckdb)
library(tidyverse)
library(testthat)

duckdb_read_csv2 <- function(conn, name, files,
                             temporary = FALSE, overwrite = FALSE, append = FALSE,
                             show_sql = FALSE, ... ){
  
  if (overwrite && append) {stop('Both overwrite and append are `TRUE`')}
  if (temporary && append) {stop('Both temporary and append are `TRUE`')}
  
  table_settings <- if (overwrite && temporary) {
    "CREATE OR REPLACE TEMP TABLE"
  } else if (overwrite) {
    "CREATE OR REPLACE TABLE"
  } else if (append) {
    "INSERT INTO "
  } else {
    "CREATE TABLE "
  }
  
  # name <- dbQuoteString(conn, name)
  files <- normalizePath(files, winslash = "/")
  files <- dbQuoteString(conn, files)
  files <- paste0("[", paste(dbQuoteString(conn, files), collapse = ",\n"), "]")
  
  if (length(list(...)) > 0) {
    csv_args <- lapply(list(...), format_dots)
    csv_args_str <- paste(names(csv_args), "=", csv_args, collapse = ",\n")
    read_csv_str <- paste0('read_csv(', files, ', ', csv_args_str, ')')
    
  } else {
    read_csv_str <- paste0('read_csv(', files,')')
  }
  
  sql_query <- paste(table_settings, name, "AS SELECT *\n  FROM", read_csv_str, ";") |> DBI::SQL()
  
  if (show_sql) {
    message(sql_query)
  }
  dbExecute(conn, sql_query)
  
 # invisible(dplyr::tbl(conn, name)) # This is my prefered output personally but requires dplyr 
out <- dbGetQuery(conn, paste("SELECT COUNT(*) FROM", tablename))[[1]]
invisible(out)
}

# Formats dots passed to duckdb_read_csv2() based on the type of data.
format_dots <- function(x) {

  if ((is.list(x) || is.vector(x)) && !is.null(names(x))) {
    y <- paste0(dbQuoteString(conn, names(x)), " : ", dbQuoteString(conn, toupper(x)))
    x <- paste0("{\n    ", paste(y, collapse = ",\n    "), "\n}")
    
  } else if (is.list(x)) {
    paste0("[", paste(x, collapse = ",\n"), "]")
    
  } else if (is.character(x) && !is.list(x) && !is.vector(x)) {
    dbQuoteString(conn, x)
    
  } else if (is.logical(x)) {
    tolower(as.character(x))
    
  } else { 
    x 
  }
}

# Tests 
paths <- paste0('iris', seq(1:10), '.csv')
dates <- as.Date(seq(1,150), origin = "2011-01-01")
iris_dated <- iris
iris_dated$date <- dates

map(paths, function(x) write.csv(iris_dated, x, row.names = FALSE))

conn <- dbConnect(duckdb())

# columns 
duckdb_read_csv2(conn, 'iris', paths,
                 overwrite = TRUE,
                 filename = TRUE,
                 show_sql = TRUE,
                 columns = list(
                   Sepal.Length = 'DOUBLE',
                   Sepal.Width = 'DOUBLE',
                   P.L = 'DOUBLE',
                   P.W = 'DOUBLE',
                   Spec = 'VARCHAR',
                   date = 'DATE'
                 ))

res <- dbReadTable(conn, 'iris')

expect_identical(colnames(res), c('Sepal.Length', 'Sepal.Width', 'P.L',
                                  'P.W', 'Spec', 'date', 'filename'))

expect_equal(nrow(res), nrow(iris)*10)
expect_equal(res$date, rep(dates, 10))

# types 
duckdb_read_csv2(conn, 'iris', paths,
                 overwrite = TRUE,
                 filename = TRUE,
                 show_sql = TRUE,
                 types = list('DOUBLE','DOUBLE','DOUBLE',
                              'DOUBLE','VARCHAR','DATE'
                 ))

res <- dbReadTable(conn, 'iris')

expect_equal(colnames(res), c(colnames(iris), 'date', 'filename'))
expect_true(is.numeric(res$Sepal.Length))
expect_true(inherits(res$date[[1]], "Date"))
expect_true(is.character(res$Species))


eli-daniels avatar Oct 15 '24 16:10 eli-daniels

That's a good start, thanks. I'd like to be able to use all ingestion features also from duckplyr: open a lazy data frame from CSV/Parquet/Excel/... . Conversely, users might want to create a lazy dbplyr table from a CSV file. So there's the following:

  • Data formats:
    • [ ] CSV: https://duckdb.org/docs/data/csv/overview
    • [ ] Parquet: https://duckdb.org/docs/data/parquet/overview
    • [ ] Excel: https://duckdb.org/docs/guides/file_formats/excel_import.html
    • [ ] JSON: https://duckdb.org/docs/data/json/overview
    • [ ] any table-valued duckdb function, really
  • Targets
    • [ ] new/existing duckdb table
    • [ ] SQL query
    • [ ] duckdb relational object
    • [ ] output file

I'm thinking about a "spec" object that could be rendered into either an SQL query or a relational object. We'd need spec constructors for each data format, and two renderers -- one for SQL and one for relational. The constructor would do the type checking. Helper functions would then compose spec constructors and renderers as needed. Perhaps we can scrape the documentation to auto-generate spec constructors?

We can also prototype in the duckplyr package, this avoids having to recompile duckdb too. Would you like to contribute?

krlmlr avatar Oct 17 '24 06:10 krlmlr

I'd be more than happy to contribute, but may be a bit out of my depth. I haven't used duckplyr before and not sure how to work with lazy tables. In terms of APIifying common duckdb functions, that seems possible.

I'm assuming the 'spec' object would be an object that defines all the specifications of a potential table or file which then gets passed on to other functions that read or write different formats, to or from tables/views, or translate to an SQL query.

eli-daniels avatar Oct 17 '24 10:10 eli-daniels

Thanks. The spec (a list with a class?) would be specific to an ingestion function and capture the arguments we want to pass to that function; rendering creates a call (in SQL or relational) to that function.

krlmlr avatar Oct 17 '24 12:10 krlmlr