dbt-duckdb
dbt-duckdb copied to clipboard
Is it possible to reference tables as parquet files in iceberg on glue via external source?
I see a method to write to glue, but no reference to use it to get the data back out?
Please help!
Mmm I've started working on iceberg as a source using an AWS catalog here via a plugin mechanism: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/iceberg.py and then used here like this: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py
...but it's still pretty early/experimental. Can you tell me a bit more about what you're up to, and I'll see if I can figure out a way to help?
Yes we have a db schema we pipe to glue that we want to do some testing/analysis on.
The db updates every few hours with an iceberg merge, so we need just the latest, to pipe into duck, and we want to then run dbt transformations on it.
Okay-- so in theory, this will work:
- pip3 install pyiceberg[glue,s3fs,pyarrow]
- pip3 install git+https://github.com/jwills/dbt-duckdb.git (b/c I haven't cut a release with this functionality enabled yet)
- Setup the environment variables for pyiceberg to use as specified here: https://py.iceberg.apache.org/configuration/
- Configure the dbt
profiles.ymlwith the iceberg plugin enabled, which would look something like this:
jaffle_shop:
target: dev
outputs:
dev:
type: duckdb
path: jaffle_shop.duckdb
threads: 4
plugins:
- module: iceberg
config:
catalog: default
- and then finally define a
sources:config entry in a dbtschema.ymlfile that references the iceberg plugin and specifies the catalog/name of the table to load in theiceberg_tableconfig property sort of like it's done here: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py#L8
...and yeah, in theory, that should load the latest iceberg snapshot in its entirety as a pyarrow table into DuckDB for subsequent transformation by the rest of the pipeline. (But again, this is super-new, I haven't tried it for real, maybe take a look at dbt-athena and see if that meets your needs, etc., etc.)