duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

Cannot read CSV or Parquet with dplyr::tbl

Open cy-james-lee opened this issue 10 months ago • 7 comments

Hello, it seems that the simple operation is not working as shown on the manual: https://duckdb.org/docs/api/r#dbplyr

Here is my reprex for both formats:

library(duckdb)
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
arrow::write_parquet(mtcars, "~/mtcars.parquet")
dplyr::tbl(conn_duck, "~/mtcars.parquet")
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.parquet) q01 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}

write.csv(mtcars, "~/mtcars.csv")
dplyr::tbl(conn_duck, '~/mtcars.csv')
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM ~/mtcars.csv) q02 WHERE (0 = 1)
#> Caused by error:
#> ! {"exception_type":"Parser","exception_message":"syntax error at or near \"~/\"","position":"20","error_subtype":"SYNTAX_ERROR"}

Created on 2024-03-29 with reprex v2.1.0

cy-james-lee avatar Mar 29 '24 17:03 cy-james-lee

Thanks. Can you please try path_expand("~/...") or normalizePath("~/...") in the dpyr::tbl() call?

krlmlr avatar Mar 29 '24 22:03 krlmlr

@krlmlr I did try with absolute paths. It still didn't work. However, I found out that duckdb was expecting single quote around the file paths. If I do sprintf("'%s'", filepath), it works.

library(duckdb)
#> Loading required package: DBI
#> Loading required package: DBI
conn_duck <- dbConnect(duckdb())
file_pq <- path.expand("~/mtcars.parquet")
arrow::write_parquet(mtcars, file_pq)
dplyr::tbl(conn_duck, file_pq)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.parquet".*
#> FROM "C:/Users/user1/Documents/mtcars.parquet"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.parquet" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_pq))
#> # Source:   SQL [?? x 11]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ more rows

file_csv <- path.expand("~/mtcars.csv")
write.csv(mtcars, file_csv)
dplyr::tbl(conn_duck, file_csv)
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "C:/Users/user1/Documents/mtcars.csv".*
#> FROM "C:/Users/user1/Documents/mtcars.csv"
#> LIMIT 11
#> Error: Binder Error: Referenced table "C:/Users/user1/Documents/mtcars.csv" not found!
#> Candidate tables: "mtcars"
dplyr::tbl(conn_duck, sprintf("'%s'", file_csv))
#> # Source:   SQL [?? x 12]
#> # Database: DuckDB v0.10.0 [user1@Windows 10 x64:R 4.3.3/:memory:]
#>    column00      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ more rows

Created on 2024-04-03 with reprex v2.1.0

cy-james-lee avatar Apr 03 '24 14:04 cy-james-lee

Thanks, good catch! Do we want to add this to the documentation?

krlmlr avatar Apr 05 '24 13:04 krlmlr

Wouldn't it make more sense to update/add function to handle these file paths?

This seems like a package design decision. What I'm understanding is the package handles files in the same directory differently than other directories.

If a file in the same directory is provided, dplyr::tbl automatically adds single quotes. example: dplyr::tbl(conn_duck, "mtcars.parquet")

However, for other absolute file paths, the user needs to manually add single quotes. example: dplyr::tbl(conn_duck, "'C:/Users/user1/Documents/mtcars.parquet'")

cy-james-lee avatar Apr 05 '24 14:04 cy-james-lee

duckdb 0.10.1 has tbl_file() that should work out of the box, totally forgot about that. 🙃

No quotes should be needed with that function.

krlmlr avatar Apr 05 '24 14:04 krlmlr

I was just writing to let you know that I didn't know about the tbl_file function. Was there a reason why tbl_file and tbl_query functions needed to be created separately instead of implementing them into tbl?

cy-james-lee avatar Apr 05 '24 14:04 cy-james-lee

tbl() is designed to work with table names or queries. Parquet or CSV files are neither. The new functions help disambiguate this and be explicit about the intent.

krlmlr avatar Apr 05 '24 14:04 krlmlr

I ended up here as I was going to report a bug but realised my issue is a duplicate of this one (i.e. I should be using tbl_file() instead of tbl()). Not sure if this reprex adds anything to the current discussion but adding it just in case it's helpful - it's hard to work out that the wrong function is the source of the problem here, as there are some circumstances in which it works and others in which it doesn't.

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

tf <- tempfile(fileext = ".csv")
readr::write_csv(mtcars, tf)

con <- dbConnect(duckdb())

# filter without duckdb
readr::read_csv(tf) |>
  filter(hp > 250)
#> Rows: 32 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 2 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 2  15       8   301   335  3.54  3.57  14.6     0     1     5     8

# raises an error
tbl(con, tf) |>
  filter(hp > 250) |>
  collect()
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error:
#> ! rapi_prepare: Failed to prepare query SELECT "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv".*
#> FROM "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv"
#> WHERE (hp > 250.0)
#> Error: Binder Error: Referenced table "/tmp/Rtmp8d9qdv/file7c7e83242c8f2.csv" not found!
#> Candidate tables: "file7c7e83242c8f2"

# works when we do a no-op mutate before the filter
tbl(con, tf) |>
  mutate(hp = hp + 0) |>
  filter(hp > 250) |>
  collect()
#> # A tibble: 2 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  15.8     8   351   264  4.22  3.17  14.5     0     1     5     4
#> 2  15       8   301   335  3.54  3.57  14.6     0     1     5     8

thisisnic avatar Jun 09 '24 09:06 thisisnic

See also #38

I would recommend using something like the read_csv() function based on the standard DuckDB functions.

tbl(con, "read_csv('mtcars.csv')")

So, if we understand that this is embedded in SQL, we can immediately imagine that other functions such as delta_scan() can be used here as well.

eitsupi avatar Jun 11 '24 16:06 eitsupi

Thanks. I renamed tbl_query() to tbl_function() and added more examples to the docs.

krlmlr avatar Aug 16 '24 20:08 krlmlr