External materialization with PER_THREAD_OUTPUT performs single-threaded write
I recently noticed that using external materialization with PER_THREAD_OUTPUT resulted in the same write performance as without PER_THREAD_OUTPUT.
I investigated this and it seems that having the COPY ... operation in the same transaction block as the CREATE TABLE ... operation prevents DuckDB from using all available threads. In the case of large files being written to S3, this leads to very poor write performance (we have 40 threads available and plenty of bandwidth, yet only one thread is being used). I opened an issue on DuckDB to know if that behavior is "by design" or a bug. I tend to think it is "by design" even though I'm not sure I fully understand the implications of DuckDB's optimistic concurrency control.
If this is not a bug, I wanted to know if it was possible to change the behavior of external materialization. The idea would be to edit external.sql to end the first transaction before write_to_file() and open a new transaction (or not) after write_to_file() for the view creation and cleanup.
This would represent a shift from the current behavior, but it would be a trade-off between transactional atomicity and performance. As it stands today, external materialization cannot realistically be used for large data as the write overhead is too significant. We are currently using a workaround with macros and post-hooks to make it work, but it's not optimal.
I would greatly appreciate your input on this one @tomsej, as this part has not changed much since you started working on it. Also my proposed solution is probably not optimal and you may have a better idea to solve that issue.
FWIW would be very interested in seeing this solved, it's an important use case. What are the consequences of the proposed fix-- like, what will happen when we change the behavior here to do the COPY outside of the transaction blocks?
There's also a decent argument for doing an overhaul here given that DuckDB has evolved quite a bit since we first wrote this materialization + I bet we could simplify the impl significantly
Regarding consequences, from my understanding, it will depend on how we decide to manage the transactions, but the most likely behavior change will be that (if some issue arise during run) we will have committed tables with __dbt_tmp suffix lying around in the db before the next run take care of them.
Another issue is about managing existing files in S3, either created during a successful run or files commited before a run fails. With the PER_THREAD_OUTPUT option, the number of files in the output is non-deterministic: for the exact same run, you may end up with 30 or 31 files for example (as discussed in this issue). There is no guarantee that run_2 will overwrite files written by run_1. Also, the OVERWRITE option is not yet supported on remote filesystems (not just S3 it seems) and as of today we don't know if there is any plan to make it work in the future (see this issue). Also we don't have much opportunity to manipulate S3 objects within dbt macros.
On our side, we use dbt macros and Airflow jobs to move the data between tmp and not-tmp paths to ensure sync behavior and atomicity.
After investigation, the issue seems to be more complex than expected, but I would like to have other inputs about it 😬 Also, I agree about the overhaul as dbt updated how to manage custom materializations as well in the meantime.