dbt-duckdb
dbt-duckdb copied to clipboard
feat: delta plugin support write
hi all, we are really into that feature. Is there a rough estimation if and when it could be merged to main branch? Thanks
Hi @nfoerster, Currently it is very hard to define some end data, but i am working on it when i have free time. The problem goes beyond just adding a plugin because to make it work in the current context with external materialization, there are things to refactor with other parts, and this branch can also break another functionality. What I am doing now is trying to understand the memory management which is happening in the background in order to propose a change in the current implementation.
I also, from time to time, rebase on the main branch, and if you want, you can use this branch to try stuff out and provide feedback/use cases which are missing, but be aware that there are possibly other things broken. Here is the example dbt project which i use while develop so there you can find some models which are writing data to Azure/local
- Append and overwrite works
- overwrite_partitions is in progress
- and merge was very slow in that time when i tried it out but can be maybe faster now
I hope I will write in next few days a small proposal to refactor the current implementation so we can implement this plugin and test it.
Hi @jwills, To make some progress on this PR, I will write my current status so that you know what I think and you can provide me with some extra information/thoughts.
The main problem is that when we use the plugin export, we have an extra step where we export data to the parquet file and then load it again into memory with pandas.
I think we have to provide a direct reference to the arrow format (ideally dataset) to the underlying export library (pandas, datafusion in delta case) if this is somehow possible.
TL;DR; Current implementation: So what do we do now?
step 3. write to storage
step 5. sqlaclemy write excel write
New implementation:
What is the problem right now?
- Looking into the arrow <-> duckdb connection i found out that .arrow() makes a copy, which is clear to me but the question is how to do it without a copy? So following https://duckdb.org/2021/12/03/duck-arrow.html i started to question the memory in the background and would like to profile the whole dbt-duckdb process This is my repo with some examples https://github.com/milicevica23/wtf-memory If you know somebody who can help me to understand how to transfer data from duckdb to arrow format with zero copy i would appreciate that a lot
Regarding other plugins i took a look into pandas <-> arrow combination https://arrow.apache.org/docs/python/pandas.html which is not ideal but still can be better than parquet export
- I have a hard time imagining how we can make it so that such an export is again referencable in the downstream models. What we do right now is export to the file and provide a view over this chunk. But this is more of a workaround than a solution. If we imagine using the SQLalchemy plugin and looking from another perspective, it is a last-run chunk of data that is pushed and not a source from the database. This can be interesting if we export locally and to parquet, but pushing to some remote storage is harder to implement. e.g https://github.com/l-mds/demo-dbt-duckdb-delta-plugin/issues/1
In my opinion i think we should ask around if somebody reference to something what is exported with a plugin and try to understand the use case and try to find a solution for this. I know that especially for delta plugin this feature would be nice to have it so i would think about this
Nevertheless, I think that some good architecture should not reference something that you export and if you want to use it in following steps it should be again written in the source definition
Next steps:
- I would like that we comment on the above and then implement what we agree on. If we go with arrow integration, this should be then implemented also in the excel and sqlalchemy plugin which have export part
- We have to think what this change means for breaking changes if there are some and prepare for that
- Optional: I would like to make as less data copy as possible so I would stay on track with duckdb <-> arrow <-> pandas integration, so i appreciate here any help
Hey @milicevica23 apologies for the lag here, I missed this go by in my email. My limited understanding of the issue is that it sounds like the external approach-- where we write a parquet file out and then treat any subsequent references to that file as a view in DuckDB-- isn't the right thing to do for delta (and presumably Iceberg) b/c both of those tools expect to operate on something like an Arrow Table/RecordBatch instead of a set of parquet files-- is that correct, or am I missing something?
If that's the case, then I think that we shouldn't keep trying to jam the iceberg and delta stuff into the external materialization; we should just create delta and iceberg materialization strategies that are designed to work in the way that those tools need them to work. It's more work, but I don't think that trying to push the external concept beyond it's design (especially if doing so would require us to break backwards compatibility) is a good idea, and I fully expect that the Iceberg/Delta table design patterns will be common enough to deserve first-class support in dbt-duckdb.
Hi @jwills, So, this problem is not on the level of a delta or iceberg but on the general plugin level. We have two use cases
- export files parquet, csv from duckdb connection
- use a plugin to export data to Excel, delta, and Iceberg from the duckdb connection
My understanding is that the first developed feature was 1. but with time, the 2. case was built on top of the first one. This introduced the wrong process with one unnecessary step, which was that we first export the file into the parquet and then reread it in the memory instead of giving the data directly from the duckdb over the arrow format to the exporter. So this is conceptually wrong, but one nice feature which was luckily born from that is that you can reference exported models but with limitations e.g you can't do incremental models with plugins. This feature should be solved with the data frame registration as we do in the read part of the plugin
I am still unsure how and what the best approach is, but we have to rethink and simplify the whole export part and combine the above 1. and 2. use cases into a unified one-plugin approach. It is not easy, but I think it is a needed step for the feature.