duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

add `temporary` parameter to `duckdb_read_csv()`

Open ThomasSoeiro opened this issue 10 months ago • 3 comments

It is currently not possible to write a temporary table using duckdb_read_csv() directly.

To fix this, can we expose the temporary parameter in duckdb_read_csv():

https://github.com/duckdb/duckdb-r/blob/b13d4050e3096c38a699719a2b5e89fc65106765/R/csv.R#L32-L33

and pass it to the call to dbWriteTable()?

https://github.com/duckdb/duckdb-r/blob/b13d4050e3096c38a699719a2b5e89fc65106765/R/csv.R#L70


For tables that fit in memory, we can use:

mytable <- read.csv2("mytable.csv")
dbWriteTable(con, "mytable", mytable, temporary = TRUE)

But we are stuck for tables that don't fit in memory.

ThomasSoeiro avatar Apr 22 '24 10:04 ThomasSoeiro

Thanks.

  • The dbWriteTable() you mentioned really should be a dbCreateTable(), this had confused me a bit
  • Can you confirm that running dbExecute("COPY ... FROM ...") does what you expect it to for temporary tables? What happens with name clashes (same table exists as permanent and as temporary table)?
  • For now, you could work around:
    • Import the CSV data into a temporary database that you connect to your main database with ATTACH
    • Use something like CREATE TEMPORARY TABLE ... AS read_csv_auto('...') (not tested)

krlmlr avatar Apr 24 '24 05:04 krlmlr

Prepare files and db for testing
library(DBI)

write.csv(warpbreaks, "warpbreaks.csv", row.names = FALSE)
write.csv(cars, "cars.csv", row.names = FALSE)

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb")
duckdb::duckdb_read_csv(con, "warpbreaks1", "warpbreaks.csv")
dbListTables(con)
# [1] "warpbreaks1"
dbDisconnect(con, shutdown = TRUE)
  • Can you confirm that running dbExecute("COPY ... FROM ...") does what you expect it to for temporary tables? (...)

Yes, it does:

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb", read_only = TRUE)
dbListTables(con)
# [1] "warpbreaks1"
dbExecute(con, "CREATE TEMPORARY TABLE warpbreaks2 (breaks NUMERIC, wool VARCHAR, tension VARCHAR);")
# [1] 0
dbExecute(con, "COPY warpbreaks2 FROM 'warpbreaks.csv' (HEADER);")
# [1] 54
dbListTables(con)
# [1] "warpbreaks1" "warpbreaks2"
dbDisconnect(con, shutdown = TRUE)

# check that the temporary table is gone
con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb")
dbListTables(con)
# [1] "warpbreaks1"
dbDisconnect(con, shutdown = TRUE)
  • (...) What happens with name clashes (same table exists as permanent and as temporary table)?

Not sure what is expected here. Here is a test:

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb", read_only = TRUE)
dbListTables(con)
# [1] "warpbreaks1"
dbExecute(con, "CREATE TEMPORARY TABLE warpbreaks1 (speed NUMERIC, dist NUMERIC);")
# [1] 0
dbExecute(con, "COPY warpbreaks1 FROM 'cars.csv' (HEADER);")
# [1] 50
dbListTables(con)
# [1] "warpbreaks1" "warpbreaks1"
dbReadTable(con, "warpbreaks1") |> head()
#   speed dist
# 1     4    2
# 2     4   10
# 3     7    4
# 4     7   22
# 5     8   16
# 6     9   10
dbDisconnect(con, shutdown = TRUE)
  • For now, you could work around:

    • Import the CSV data into a temporary database that you connect to your main database with ATTACH
    • Use something like CREATE TEMPORARY TABLE ... AS read_csv_auto('...') (not tested)

It works, thanks! (documentation here)

con <- dbConnect(duckdb::duckdb(), dbdir = "test.duckdb", read_only = TRUE)
dbListTables(con)
# [1] "warpbreaks1"
dbExecute(con, "CREATE TEMPORARY TABLE warpbreaks3 AS FROM 'warpbreaks.csv';")
# [1] 54
dbListTables(con)
# [1] "warpbreaks1" "warpbreaks3"
dbDisconnect(con, shutdown = TRUE)

ThomasSoeiro avatar Apr 24 '24 11:04 ThomasSoeiro

Temporary tables take precedence indeed if they exist:

library(duckdb)
#> Loading required package: DBI
con <- dbConnect(duckdb())

writeLines("a\n1", "x.csv")

dbExecute(con, "CREATE TEMPORARY TABLE x (a INTEGER)")
#> [1] 0

dbExecute(con, "CREATE TABLE x (y VARCHAR)")
#> [1] 0

dbExecute(con, "COPY x FROM 'x.csv' (HEADER)")
#> [1] 1

dbReadTable(con, "x")
#>   a
#> 1 1

dbRemoveTable(con, "x", temporary = TRUE)

dbReadTable(con, "x")
#> [1] y
#> <0 rows> (or 0-length row.names)

Created on 2024-08-17 with reprex v2.1.0

We want to do the following here:

  • In duckdb_read_csv(), change dbWriteTable() to dbCreateTable()
  • Add an ellipsis after files to make sure arguments are named
  • Add a temporary = FALSE argument and modify the SQL as needed

My preferred workflow is to use dbplyr with tbl_function(), but I'm open to reviewing a PR.

krlmlr avatar Aug 17 '24 08:08 krlmlr