dbt-duckdb icon indicating copy to clipboard operation
dbt-duckdb copied to clipboard

Is it possible to reference tables as parquet files in iceberg on glue via external source?

Open dioptre opened this issue 2 years ago • 3 comments

I see a method to write to glue, but no reference to use it to get the data back out?

Please help!

dioptre avatar Jun 13 '23 21:06 dioptre

Mmm I've started working on iceberg as a source using an AWS catalog here via a plugin mechanism: https://github.com/jwills/dbt-duckdb/blob/master/dbt/adapters/duckdb/plugins/iceberg.py and then used here like this: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py

...but it's still pretty early/experimental. Can you tell me a bit more about what you're up to, and I'll see if I can figure out a way to help?

jwills avatar Jun 13 '23 21:06 jwills

Yes we have a db schema we pipe to glue that we want to do some testing/analysis on.

The db updates every few hours with an iceberg merge, so we need just the latest, to pipe into duck, and we want to then run dbt transformations on it.

dioptre avatar Jun 13 '23 21:06 dioptre

Okay-- so in theory, this will work:

  1. pip3 install pyiceberg[glue,s3fs,pyarrow]
  2. pip3 install git+https://github.com/jwills/dbt-duckdb.git (b/c I haven't cut a release with this functionality enabled yet)
  3. Setup the environment variables for pyiceberg to use as specified here: https://py.iceberg.apache.org/configuration/
  4. Configure the dbt profiles.yml with the iceberg plugin enabled, which would look something like this:
jaffle_shop:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: jaffle_shop.duckdb
      threads: 4
      plugins:
        - module: iceberg
           config:
              catalog: default
  1. and then finally define a sources: config entry in a dbt schema.yml file that references the iceberg plugin and specifies the catalog/name of the table to load in the iceberg_table config property sort of like it's done here: https://github.com/jwills/dbt-duckdb/blob/master/tests/functional/plugins/test_iceberg.py#L8

...and yeah, in theory, that should load the latest iceberg snapshot in its entirety as a pyarrow table into DuckDB for subsequent transformation by the rest of the pipeline. (But again, this is super-new, I haven't tried it for real, maybe take a look at dbt-athena and see if that meets your needs, etc., etc.)

jwills avatar Jun 13 '23 21:06 jwills