dbt-duckdb icon indicating copy to clipboard operation
dbt-duckdb copied to clipboard

External materialized model with `per_thread_output` parameter does not work

Open elephantmetropolis opened this issue 1 year ago • 8 comments

Hello, I am running into an issue when using the external materialization with per_thread_output option.

This option is supposed to create a number of files based on the number of threads used by DuckDB.

When this parameter is passed along with external materialization, I receive an error:

HTTP Error: Unable to connect to URL "https://my.s3/my_bucket/my_path": 404 (Not Found)

The funny thing is the files are actually written to S3 in the right place. And the current model is crashing, not the downstream model. For example, if I have model_b depending on model_a, the run is as follow.

11:04:12  1 of 2 START sql external model main.table_a ............... [RUN]
11:04:22  1 of 2 ERROR creating sql external model main.model_a ...... [ERROR in 9.70s]
11:04:22  2 of 2 SKIP relation main.model_b ................... [SKIP]

After some research I found that part of the code: https://github.com/duckdb/dbt-duckdb/blob/6b539a6f8192e0ec1d4e9a310f4a6b64f53d830b/dbt/adapters/duckdb/impl.py#L173 From what I understand, only models created with partition_by parameter can use file glob, and I suppose it's the problem here.

Is this assumption correct ? If yes, is it possible to add support for the parameter per_thread_output or is it better to manage this issue in a more "parameter agnostic" way ?

I'm not an expert but if I can contribute in any way I would appreciate.

Edit: I tried to modify impl.py and it seems this works on my side : https://github.com/duckdb/dbt-duckdb/pull/493

Edit 2: After more investigation, it seems the parameter per_thread_output is not working properly for external materializations. I tested to write to S3 using post-hook and macro and this code works:

{% macro export_to_s3(table) %}
{% set s3_path = 's3://my_bucket' %}
    COPY {{ table }} 
    TO '{{ s3_path }}/{{ table.name }}'
    (FORMAT PARQUET, PER_THREAD_OUTPUT TRUE);
{% endmacro %}

But using per_thread_output directly in {{ config() }} will end up with DuckDB writing one big file to S3.

For info, I set DuckDB to use 30 threads and 300GB ram.

However, the following code doesn't work properly, only one file is created, but, the file is created inside subfolder, which makes sense regarding expected PER_THREAD_OUTPUT behavior:

{{
    config(
        materialized='external',
        format='parquet'
        options={
            "per_thread_output": 1
        }
    )
}}
...

It seems there is two issues here, one being the per_thread_output not taken into account with file glob and the second is the per_thread_output parameter behaving weirdly (only using one thread ?).

elephantmetropolis avatar Dec 17 '24 12:12 elephantmetropolis

@elephantmetropolis thanks for the report here, this is a fun one-- taking a look

jwills avatar Dec 18 '24 00:12 jwills

Quick update on this issue regarding external materialization using only one thread.

My first (DuckDB-uneducated) assumption was that, due to some DuckDB optimization, the query planner decided to use only one thread to perform the COPY operation.

I decided to test with a very simple model I use (DuckDB) threads = 40, memory_limit = '300GB' and (DBT) threads = 1

{{ 
    config(
        materialized='external',
        format='parquet', 
        options={
            'per_thread_output': true
        }
    )
}}

select * 
from {{ ref('my_5GB_table') }} -- 'my_5GB_table' is materialized as a table

The behaviour is still the same. It writes 1 file that is 5GB big in S3 in 2 min 30 sec.

Now, if I connect to the DuckDB database (the same used by DBT) with ./duckdb and I run the following command (which is supposed to be the exact same coomand afaik):

copy (select * from my_5GB_table) to 's3://my_bucket/my_path' (format parquet, per_thread_output);

It writes 40 files to S3 in ~3 sec When I check CPU usage with btop, I clearly see that the DBT run uses only 1 thread, whereas the manual run uses 40 threads.

I’ve checked everything I could in the implementation of the external materialization, but I cannot find why this behavior occurs.

elephantmetropolis avatar Jan 09 '25 11:01 elephantmetropolis

Oh I've been there before-- the most likely explanation is that the Python version of DuckDB you're using with dbt isn't the same as the CLI version you're running locally here (i.e. the python package is on an older version of DuckDB that doesn't have this feature supported yet.)

The other possibility is that the version of dbt-duckdb you're using doesn't have the fix you included since I don't think we've cut a release since you merged it-- are you running off of your own build of dbt-duckdb, or did you do something like pip install --upgrade git+https://github.com/duckdb/dbt-duckdb.git@master to get the version with your fix?

jwills avatar Jan 09 '25 15:01 jwills

Thanks for the quick reply ! I just checked and both CLI and Python DuckDB are using 1.1.3.

about dbt-duckdb, I am using a local build which includes my fix and the fixed part works as expected. However, this fix is only about the reading part of the external materialization. From what I saw, there was nothing to fix in the writing part.

Actually this issue is not exactly related to the inital issue I opened, maybe I could open a new one to make things more clear.

elephantmetropolis avatar Jan 10 '25 07:01 elephantmetropolis

Ah okay I think I follow now, sorry about that. So I think the issue might be in how we are rendering the per_thread_output option, which will be based on the logic here: https://github.com/duckdb/dbt-duckdb/blob/master/dbt/adapters/duckdb/impl.py#L159

You might want to try special-casing it as an option there, like we do for e.g. partition_by and some of the other common options, to see if that fixes the issue?

jwills avatar Jan 10 '25 23:01 jwills

That's what I was thinking before, but the behavior is not the same when running with and without PER_THREAD_OUTPUT.

For example when I set PER_THREAD_OUTPUT to true, it will create the file inside an S3 subfolder:

my_bucket/my_output.parquet/data_0.parquet

v.s. when I run the exact same pipeline without PER_THREAD_OUTPUT the folder structure will be:

my_bucket/my_output.parquet with my_output.parquet being the file containing the actual data.

Also, I tested the unit tests and checked the output and PER_THREAD_OUTPUT is processed as expected by default.

I will still try as you said with the special-casing and see if it helps but I doubt.

elephantmetropolis avatar Jan 13 '25 10:01 elephantmetropolis

Hi @jwills I know you are busy recently but did you had some time to check that issue ? :) I don't know if it's a problem with dbt-duckdb though, could be more on duckdb side. Thanks !

elephantmetropolis avatar Feb 24 '25 12:02 elephantmetropolis

Ah so the special-casing trick via the options thing didn't work I assume, okay. I'm sorry I don't have a ton of time to hack on this thing given the constraint of having a non-DuckDB related real job, but let's see if @guenp or @matsonj has a moment to try this out

jwills avatar Feb 24 '25 19:02 jwills