duckdb-r
duckdb-r copied to clipboard
Cannot read CSV or Parquet with dplyr::tbl
Hello, it seems that the simple operation is not working as shown on the manual: https://duckdb.org/docs/api/r#dbplyr
Here is my reprex for both formats:
library(duckdb)
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
arrow::write_parquet(mtcars, "~/mtcars.parquet")
dplyr::tbl(conn_duck, "~/mtcars.parquet")
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.parquet) q01 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}
write.csv(mtcars, "~/mtcars.csv")
dplyr::tbl(conn_duck, '~/mtcars.csv')
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.csv) q02 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}
Created on 2024-03-29 with reprex v2.1.0
Thanks. Can you please try path_expand("~/...")
or normalizePath("~/...")
in the dpyr::tbl()
call?
@krlmlr I did try with absolute paths. It still didn't work. However, I found out that duckdb was expecting single quote around the file paths. If I do sprintf("'%s'", filepath)
, it works.
library(duckdb)
#> Loading required package: DBI
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
file_pq <- path.expand("~/mtcars.parquet")
arrow::write_parquet(mtcars, file_pq)
dplyr::tbl(conn_duck, file_pq)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.parquet".*
#> FROM "C:/Users/user1/Documents/mtcars.parquet"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.parquet" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_pq))
#> # Source: SQL [?? x 11]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ more rows
file_csv <- path.expand("~/mtcars.csv")
write.csv(mtcars, file_csv)
dplyr::tbl(conn_duck, file_csv)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.csv".*
#> FROM "C:/Users/user1/Documents/mtcars.csv"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.csv" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_csv))
#> # Source: SQL [?? x 12]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#> column00 mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ more rows
Created on 2024-04-03 with reprex v2.1.0
Thanks, good catch! Do we want to add this to the documentation?
Wouldn't it make more sense to update/add function to handle these file paths?
This seems like a package design decision. What I'm understanding is the package handles files in the same directory differently than other directories.
If a file in the same directory is provided, dplyr::tbl
automatically adds single quotes.
example: dplyr::tbl(conn_duck, "mtcars.parquet")
However, for other absolute file paths, the user needs to manually add single quotes.
example: dplyr::tbl(conn_duck, "'C:/Users/user1/Documents/mtcars.parquet'")
duckdb 0.10.1 has tbl_file()
that should work out of the box, totally forgot about that. 🙃
No quotes should be needed with that function.
I was just writing to let you know that I didn't know about the tbl_file
function. Was there a reason why tbl_file
and tbl_query
functions needed to be created separately instead of implementing them into tbl
?
tbl()
is designed to work with table names or queries. Parquet or CSV files are neither. The new functions help disambiguate this and be explicit about the intent.
I ended up here as I was going to report a bug but realised my issue is a duplicate of this one (i.e. I should be using tbl_file()
instead of tbl()
). Not sure if this reprex adds anything to the current discussion but adding it just in case it's helpful - it's hard to work out that the wrong function is the source of the problem here, as there are some circumstances in which it works and others in which it doesn't.
library(duckdb)
#> Loading required package: DBI
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
tf <- tempfile(fileext = ".csv")
readr::write_csv(mtcars, tf)
con <- dbConnect(duckdb())
# filter without duckdb
readr::read_csv(tf) |>
filter(hp > 250)
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 2 15 8 301 335 3.54 3.57 14.6 0 1 5 8
# raises an error
tbl(con, tf) |>
filter(hp > 250) |>
collect()
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv".*
#> FROM "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv"
#> WHERE (hp > 250.0)
#> Error: Binder Error: Referenced table "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv" not found!
#> Candidate tables: "file7c7e83242c8f2"
# works when we do a no-op mutate before the filter
tbl(con, tf) |>
mutate(hp = hp + 0) |>
filter(hp > 250) |>
collect()
#> # A tibble: 2 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
#> 2 15 8 301 335 3.54 3.57 14.6 0 1 5 8
See also #38
I would recommend using something like the read_csv()
function based on the standard DuckDB functions.
tbl(con, "read_csv('mtcars.csv')")
So, if we understand that this is embedded in SQL, we can immediately imagine that other functions such as delta_scan()
can be used here as well.
Thanks. I renamed tbl_query()
to tbl_function()
and added more examples to the docs.