dbt-duckdb icon indicating copy to clipboard operation
dbt-duckdb copied to clipboard

Incremental external models

Open nfcampos opened this issue 2 years ago • 5 comments

Hi

Do you think it's feasible to add support for incremental materialisation to external location? Ie. extending the external materialisation to support incremental semantics.

Namely I was thinking of the simplest case of an "append" strategy implemented as just writing a new parquet file. And the model overall would be defined with a glob location.

Do you see any drawbacks of such an approach? I'm happy to submit a PR if you point me in the right direction in terms of what API you'd expect this to have? Make this part of the external materialisation, or of the incremental one, or a completely new one?

Thanks Nuno

nfcampos avatar Dec 22 '22 22:12 nfcampos

Hey Nuno! I don't know if that is feasible; as I understand it, dbt only supports incremental materialization against the table materialization type. The only way to do it, I think, would be to unify the external materialization into the table materialization (so that an external materialization would become just a special case of a table materialization), but I don't have a great sense for how complex that would be to do.

jwills avatar Dec 24 '22 05:12 jwills

Hey @nfcampos, parquet doesn't (AFAIK) support any updates in place that aren't overwrite, which means that we can't directly append into parquets. I'm unsure about CSVs, at a guess inserts might be possible but I doubt updates would be.

Generally, this smells a lot like Iceberg / something similar. I remember when myself and @tomsej were initially discussing external materializations (over in https://github.com/jwills/dbt-duckdb/issues/19), we made a decision to not make this adapter too file management-y which I personally think this pattern could quickly introduce.

jpmmcneill avatar Feb 01 '23 12:02 jpmmcneill

I think what I meant with parquet files was to add a new parquet file to the same directory, with the model defined using a glob, see here on duckdb support for this https://duckdb.org/docs/data/parquet#multi-file-reads-and-globs

But I agree that this walks in the direction of things like Iceberg etc. If only Iceberg could be used in a more serverless way, ie. without a catalog server/db running somewhere... Actually maybe it can, with a JDBC catalog ran from a sqlite db, but haven't tried that 🤔 (only mention of something like this I can find https://github.com/apache/iceberg/issues/3245#issuecomment-1031730083)

nfcampos avatar Feb 01 '23 12:02 nfcampos

I think what I meant with parquet files was to add a new parquet file to the same directory, with the model defined using a glob, see here on duckdb support for this https://duckdb.org/docs/data/parquet#multi-file-reads-and-globs

But I agree that this walks in the direction of things like Iceberg etc. If only Iceberg could be used in a more serverless way, ie. without a catalog server/db running somewhere... Actually maybe it can, with a JDBC catalog ran from a sqlite db, but haven't tried that 🤔 (only mention of something like this I can find apache/iceberg#3245 (comment))

Keep an eye on https://github.com/sutoiku/puffin re serverless iceberg 😉 I'm very excited about this.

On the globs, I've used it and it's fine. It's something that would be cool to extend the materializations with. I will probably spike this out over the weekend and follow up here 👍

jpmmcneill avatar Feb 01 '23 12:02 jpmmcneill

@jpmmcneill Ah did not know about Puffin, very interesting

nfcampos avatar Feb 01 '23 12:02 nfcampos