data-diff
data-diff copied to clipboard
Add support for DuckDB
DuckDB is an in-process database. You typically create it as a session, then discard it once you're done (though not the only way to use it)
It's awesome for a few reasons that apply to data-diff. Namely, you can direct-query raw csv/txt/parquet files as though they were tables. (eg select posting_date, count(*) as r_count from '/Users/me/data.csv' group by posting_date
)
We use this ability to load PROD v UAT files from our system to compare output. Being able to pass this across to data-diff would be incredible.
Whilst just being able to reference csv files in data-diff might be another option, doing this via duckDB would allow you to perform some basic transformations on the way; such as renaming fields, selecting a reduced range etc
I have actually started working on a duckdb driver not so long ago, might have something ready next week, but the second part of this
Whilst just being able to reference csv files in data-diff might be another option, doing this via duckDB would allow you to perform some basic transformations on the way; such as renaming fields, selecting a reduced range etc
might deserve a separate issue as it could be generalized for all drivers, no?
I have actually started working on a duckdb driver not so long ago, might have something ready next week, but the second part of this
Nice! Would love to give it a go when you have something (though should point out I'm a data-diff newbie, so not across every aspect of it)
might deserve a separate issue as it could be generalized for all drivers, no?
I absolutely agree... though I think an aspect of this is captured in https://github.com/datafold/data-diff/issues/79
DuckDB is now supported! It's already available in master
, and will be included in the upcoming release.
Awesome! Look forward to trying it out!