S3 Parquet Support
Hi There!
DuckDB has the support to load Parquet files from S3. Are there any plans on supporting the ability to query from S3 Parquet files from dbt-duckdb? Would love to help in any way I can!
Hey @exemaitch! That would be very cool, let me see if I can cook something up in a branch right quick...
Thinking about something like this, which I put on a branch off of where I was working on the changes for dbt 1.2.0: https://github.com/jwills/dbt-duckdb/commit/e0ef672899e0c92269753c2b4690d02369af09d2
what do you think?
Hi @jwills! Thank you!
I went ahead and tested your branch and received the following error:
19:30:58
19:30:58 Finished running 1 table model, 1 view model in 0 hours 0 minutes and 0.55 seconds (0.55s).
19:30:58
19:30:58 Completed with 1 error and 0 warnings:
19:30:58
19:30:58 Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
19:30:58 Unable to connect to URL "s3://test-dbt-duckdb/example.parquet": 403 (Forbidden)
19:30:58
19:30:58 Done. PASS=0 WARN=0 ERROR=1 SKIP=1 TOTAL=2
make: *** [Makefile:3: dbt-run] Error 1
Here is my setup:
- I created a test repo with an example dbt project created with
dbt initand your branch of dbt-duckdb installed. - My profile.yml looks like this and the env vars have been set properly:
test_dbt_duckdb:
target: dev
outputs:
dev:
type: duckdb
path: 'foo.duckdb'
#optional fields
schema: main
s3_region: 'us-east-1'
s3_access_key_id: "{{ env_var('AWS_ACCESS_KEY_ID')}}"
s3_secret_access_key: "{{ env_var('AWS_SECRET_ACCESS_KEY' )}}"
- My model looks like this:
{{ config(materialized='table') }}
with source_data as (
SELECT * FROM read_parquet('s3://test-dbt-duckdb/example.parquet')
)
select *
from source_data
- When I set a
breakpoint()around here, I can see that thecredentialsvariable is set properly with my AWS credentials. - I wrote a sample script as a sanity test that can properly pull data from my example parquet file so I have verified that I can pull data from the parquet file via DuckDB with the following code (that very much looks like yours!):
def read_parquet_from_s3():
s3_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
s3_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
con = duckdb.connect(database=":memory:", read_only=False)
con.execute("INSTALL 'httpfs'")
con.execute("LOAD 'httpfs'")
con.execute("SET s3_region='us-east-1'")
con.execute(f"SET s3_access_key_id='{s3_access_key_id}'")
con.execute(f"SET s3_secret_access_key='{s3_secret_access_key}'")
results = con.execute("SELECT * FROM read_parquet('s3://test-dbt-duckdb/example.parquet')").fetchall()
print(results)
All in all - I think the code in your branch is very close but even though the credentials appear to be set properly, either they are not being set properly or perhaps DuckDB is not using the same connection. I am going to spend some more time troubleshooting - if you have any ideas from here - would love to hear them!
Thank you again for your branch! 👏🏽
That is a dream report @exemaitch, thank you so much.
Based on what you saw here, I think the issue is as you said-- the cursor (aka PyConnection) that DuckDB is using to do the dbt operations is different from the one that I'm initializing in the open block, and apparently the settings aren't getting persisted across the different connections. I just pushed another commit https://github.com/jwills/dbt-duckdb/commit/5edec53eb90e56cb91125e9e49d39ded1be33f9a that hopefully fixes it if you have time to give it another shot?
Hi @jwills - thank you for the commit! I am still seeing the same error:
22:03:20 Runtime Error in model my_first_dbt_model (models/example/my_first_dbt_model.sql)
22:03:20 Unable to connect to URL "s3://test-dbt-duckdb/example.parquet": 403 (Forbidden)
I fired up pdb again in a cursory (haha) fashion and saw no glaring indications of what the issue may be. I am going to spend some more time troubleshooting and will report back!
@exemaitch sorry it took me so long to get back to this, it's been a rough couple of months. ;-)
This should work now as of dbt-duckdb 1.2.0, which is up on pypi https://pypi.org/project/dbt-duckdb/
Please LMK if you have any issues with it!