duckdb-r
duckdb-r copied to clipboard
Revisit `tbl.duckdb_connection()`
I don't understand the code and the test:
https://github.com/duckdb/duckdb-r/blob/aef914bf34f4ff7dd17b3c6f5b97586f04a9c93f/R/backend-dbplyr__duckdb_connection.R#L370-L379
https://github.com/duckdb/duckdb-r/blob/aef914bf34f4ff7dd17b3c6f5b97586f04a9c93f/tests/testthat/test_tbl__duckdb_connection.R#L30-L43
I have the following questions:
- What is this object cache for parquet files?
- We're only executing the
PRAGMA
withcache = TRUE
, is this desired? Should we also run it withcache = FALSE
? - Does the
PRAGMA
give a permanent side effect on the connection (yes according to the test)? Should we reset it afterward? - To my understanding, the use of
ident_q()
is discouraged and also has no effect here because we're usingNextMethod()
. What's the intention here, do we need it? - Extra arguments to
tbl()
seem to be broken via https://github.com/tidyverse/dbplyr/issues/1384, should we perhaps offer a different, more robust API? We could implement a dedicated function that documents all available pragmas and perhaps even offers them as autocomplete.
CC @mgirlich + @hannes.
I implemented this one, so a few comments:
Object cache related things can be dropped, those are remains from the earlier version that was intended for enabling tbl to work with parquet files. In any case, object cache for parquet is a global option for DuckDB connection that caches metadata during the first read of parquet and can make queries much quicker on some cases. As cache works only on subsequent queries (i.e. it caches the metadata during the first query), it should not be switched off. Moreover, this function is not a place for turning it off anyhow as it is a global option and not a file specific one. The possibility to turn it on here is just a convenience function as then you don't need to deal with DuckDB settings manually from the R side (if you forgot to turn it on when the DuckDB driver [yes in DuckDB DBI-driver allows parameters to be set, not actual connection] was initialized).
Feel free to change the ident_q, but beware of breaking the core functionality here. The magic happens here on the DuckDB parser, so getting quotes correctly through dbplyr pipeline (quotes only if table exists, otherwise no quotes so that DuckDB can infer the "table" to be a function or start a replacement scan based on file extension) here is essential.
I would say that this function that allows to easily connect tbl to DuckDB supported files or DuckDB functions is one of the most important functionalities required by the end-users, so breaking this would be a disaster.
Related: #38.
We now have tbl_file()
and tbl_query()
, are we good here?
Please open a new issue if needed.