dbt-duckdb
dbt-duckdb copied to clipboard
Can not read external file from s3 private bucket
- Create a private bucket from AWS S3, set environment variables.
- Run by dbt run -s ./models
- Got error
Python model failed: HTTP Error: HTTP GET error on 'https://bucket.s3.amazonaws.com/ms/distribution/csv/orders.csv' (HTTP 403) - My codes:
profiles.yml:
default:
outputs:
dev:
type: duckdb
path: db/dbt.duckdb
extensions:
- httpfs
- parquet
- aws
settings:
s3_region: "us-east-1"
s3_access_key_id: "{{ env_var('S3_KEY') }}"
s3_secret_access_key: "{{ env_var('S3_SECRET') }}"
target: dev
sources.yml:
sources:
- name: csv
meta:
external_location: "s3://bucket/ms/distribution/csv/{name}.csv"
tables:
- name: orders
- name: customers
models.py:
def model(dbt, session):
orders = (
dbt.source("csv", "orders").filter("status='PENDING_PAYMENT'").set_alias("o")
)
customers = dbt.source("csv", "customers").set_alias("c")
pending_orders = orders.join(customers, "o.customer_id=c.id")
return pending_orders
- Some findings: If I persist the s3 secret to duckdb home folder - "~/.duckdb/stored_secrets/" then the dbt run can work without error. It seems the dbt-duckdb did not actually feed s3 credentials to duckdb.
It would be surprising if the problem was that we weren't feeding the s3 credentials to DuckDB given that so many people run with those settings (that is, I feel like I would have heard that complaint from lots of people if that was broken.)
The most common problem I see with these settings is that the region isn't configured correctly (i.e., the data lives in an S3 bucket defined in a region that is different from the one defined in s3_region.) Of course, if you have a way to make this work using the CREATE SECRET construct (which it sounds like you do), you can just run that in a macro that is configured using the on-run-start hook and be on your way.
@jwills Thanks for your reply, I double checked my s3_region and I am pretty sure the value is correct. I also did some testing by adding below config to write files to s3 bucket, it worked without using CREATE SECRET construct. So it proved that my s3 settings were correct. I also don't believe DuckDB has problem with reading files from s3, but could be something from dbt-duckdb adapter caused s3 settings not proper configured on DuckDB so caused reading from s3 bucket not working. Below is the config I used so write table to external files on S3 bucket. Again, writing worked without the CREATE SECRET construct, but reading was not working.
Model.py
def model(dbt, session):
dbt.config(
materialized="external",
location="s3://bucket/parquet/duckdb-sample/pending_orders.parquet",
format="parquet",
options={"partition_by": "customer_id", "overwrite_or_ignore": True},
glue_register=True,
glue_database="duckdb_sample",
)
orders = (
dbt.source("csv", "orders").filter("status='PENDING_PAYMENT'").set_alias("o")
)
customers = dbt.source("csv", "customers").set_alias("c")
pending_orders = orders.join(customers, "o.customer_id=c.id")
return pending_orders
Stumbled upon the same issue for some time, not sure if it also fixes it for you (and I realise it's a few months later). However simply rewriting the 'settings' in your profiles.yml file to:
secrets:
- type: s3 region: key_id: secret:
did the trick for me!