dbt-duckdb icon indicating copy to clipboard operation
dbt-duckdb copied to clipboard

S3 Parquet Support

Open exemaitch opened this issue 3 years ago • 5 comments

Hi There!

DuckDB has the support to load Parquet files from S3. Are there any plans on supporting the ability to query from S3 Parquet files from dbt-duckdb? Would love to help in any way I can!

exemaitch avatar Aug 10 '22 06:08 exemaitch

Hey @exemaitch! That would be very cool, let me see if I can cook something up in a branch right quick...

jwills avatar Aug 12 '22 23:08 jwills

Thinking about something like this, which I put on a branch off of where I was working on the changes for dbt 1.2.0: https://github.com/jwills/dbt-duckdb/commit/e0ef672899e0c92269753c2b4690d02369af09d2

what do you think?

jwills avatar Aug 12 '22 23:08 jwills

Hi @jwills! Thank you!

I went ahead and tested your branch and received the following error:

19:30:58  
19:30:58  Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.55 seconds (0.55s).
19:30:58  
19:30:58  Completed with 1 error and 0 warnings:
19:30:58  
19:30:58  Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
19:30:58    Unable to connect to URL "s3://test-dbt-duckdb/example.parquet": 403 (Forbidden)
19:30:58  
19:30:58  Done. PASS=0 WARN=0 ERROR=1 SKIP=1 TOTAL=2
make: *** [Makefile:3: dbt-run] Error 1

Here is my setup:

  1. I created a test repo with an example dbt project created with dbt init and your branch of dbt-duckdb installed.
  2. My profile.yml looks like this and the env vars have been set properly:
test_dbt_duckdb:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'foo.duckdb'
      #optional fields
      schema: main
      s3_region: 'us-east-1'
      s3_access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID')}}"
      s3_secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY' )}}"
  1. My model looks like this:
{{ config(materialized='table') }}

with source_data as (

    SELECT * FROM read_parquet('s3://test-dbt-duckdb/example.parquet')
)

select *
from source_data
  1. When I set a breakpoint() around here, I can see that the credentials variable is set properly with my AWS credentials.
  2. I wrote a sample script as a sanity test that can properly pull data from my example parquet file so I have verified that I can pull data from the parquet file via DuckDB with the following code (that very much looks like yours!):
def read_parquet_from_s3():
    s3_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
    s3_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
    con = duckdb.connect(database=":memory:", read_only=False)
    con.execute("INSTALL 'httpfs'")
    con.execute("LOAD 'httpfs'")
    con.execute("SET s3_region='us-east-1'")
    con.execute(f"SET s3_access_key_id='{s3_access_key_id}'")
    con.execute(f"SET s3_secret_access_key='{s3_secret_access_key}'")
    results = con.execute("SELECT * FROM read_parquet('s3://test-dbt-duckdb/example.parquet')").fetchall()
    print(results)

All in all - I think the code in your branch is very close but even though the credentials appear to be set properly, either they are not being set properly or perhaps DuckDB is not using the same connection. I am going to spend some more time troubleshooting - if you have any ideas from here - would love to hear them!

Thank you again for your branch! 👏🏽

exemaitch avatar Aug 14 '22 19:08 exemaitch

That is a dream report @exemaitch, thank you so much.

Based on what you saw here, I think the issue is as you said-- the cursor (aka PyConnection) that DuckDB is using to do the dbt operations is different from the one that I'm initializing in the open block, and apparently the settings aren't getting persisted across the different connections. I just pushed another commit https://github.com/jwills/dbt-duckdb/commit/5edec53eb90e56cb91125e9e49d39ded1be33f9a that hopefully fixes it if you have time to give it another shot?

jwills avatar Aug 14 '22 22:08 jwills

Hi @jwills - thank you for the commit! I am still seeing the same error:

22:03:20  Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
22:03:20    Unable to connect to URL "s3://test-dbt-duckdb/example.parquet": 403 (Forbidden)

I fired up pdb again in a cursory (haha) fashion and saw no glaring indications of what the issue may be. I am going to spend some more time troubleshooting and will report back!

exemaitch avatar Aug 15 '22 22:08 exemaitch

@exemaitch sorry it took me so long to get back to this, it's been a rough couple of months. ;-)

This should work now as of dbt-duckdb 1.2.0, which is up on pypi https://pypi.org/project/dbt-duckdb/

Please LMK if you have any issues with it!

jwills avatar Sep 23 '22 23:09 jwills