duckdb-r icon indicating copy to clipboard operation
duckdb-r copied to clipboard

Revisit `tbl.duckdb_connection()`

Open krlmlr opened this issue 1 year ago • 3 comments

I don't understand the code and the test:

https://github.com/duckdb/duckdb-r/blob/aef914bf34f4ff7dd17b3c6f5b97586f04a9c93f/R/backend-dbplyr__duckdb_connection.R#L370-L379

https://github.com/duckdb/duckdb-r/blob/aef914bf34f4ff7dd17b3c6f5b97586f04a9c93f/tests/testthat/test_tbl__duckdb_connection.R#L30-L43

I have the following questions:

  • What is this object cache for parquet files?
  • We're only executing the PRAGMA with cache = TRUE, is this desired? Should we also run it with cache = FALSE?
  • Does the PRAGMA give a permanent side effect on the connection (yes according to the test)? Should we reset it afterward?
  • To my understanding, the use of ident_q() is discouraged and also has no effect here because we're using NextMethod() . What's the intention here, do we need it?
  • Extra arguments to tbl() seem to be broken via https://github.com/tidyverse/dbplyr/issues/1384, should we perhaps offer a different, more robust API? We could implement a dedicated function that documents all available pragmas and perhaps even offers them as autocomplete.

CC @mgirlich + @hannes.

krlmlr avatar Oct 27 '23 14:10 krlmlr

I implemented this one, so a few comments:

Object cache related things can be dropped, those are remains from the earlier version that was intended for enabling tbl to work with parquet files. In any case, object cache for parquet is a global option for DuckDB connection that caches metadata during the first read of parquet and can make queries much quicker on some cases. As cache works only on subsequent queries (i.e. it caches the metadata during the first query), it should not be switched off. Moreover, this function is not a place for turning it off anyhow as it is a global option and not a file specific one. The possibility to turn it on here is just a convenience function as then you don't need to deal with DuckDB settings manually from the R side (if you forgot to turn it on when the DuckDB driver [yes in DuckDB DBI-driver allows parameters to be set, not actual connection] was initialized).

Feel free to change the ident_q, but beware of breaking the core functionality here. The magic happens here on the DuckDB parser, so getting quotes correctly through dbplyr pipeline (quotes only if table exists, otherwise no quotes so that DuckDB can infer the "table" to be a function or start a replacement scan based on file extension) here is essential.

I would say that this function that allows to easily connect tbl to DuckDB supported files or DuckDB functions is one of the most important functionalities required by the end-users, so breaking this would be a disaster.

rsund avatar Nov 01 '23 20:11 rsund

Related: #38.

krlmlr avatar Dec 02 '23 10:12 krlmlr

We now have tbl_file() and tbl_query(), are we good here?

krlmlr avatar Apr 24 '24 07:04 krlmlr

Please open a new issue if needed.

krlmlr avatar Aug 17 '24 08:08 krlmlr