datacontract-cli icon indicating copy to clipboard operation
datacontract-cli copied to clipboard

datacontract import --format dbt

Open simonharrer opened this issue 1 year ago • 3 comments

Out of #103 came the idea of having an import of dbt models to a datacontract.yaml

datacontract import --format dbt models.yaml

simonharrer avatar Mar 21 '24 10:03 simonharrer

I already do something like this for import, creating a datacontract.yaml given a dbt project, but was using the "schema" field instead of the "models" field, with a custom schema type. (Slightly off-topic, but schema was much more widely understandable than models in our workshops. Just some feedback I can provide on its depreciation in the specification).

However our code is/was quite specific to the format of the dbt projects we allowed. To do it properly, one would want to parse & use the manifest.json file from a dbt project. It is the most straightforward way of working with dbt projects generically.

You would go into dbt Nodes in the manifest, and for every resource_type of model import the columns, data_types if given, descriptions if given, etc. The only difficulty is mapping the data_types to the supported ones in datacontract spec. Hence why physical model specific schema might make more sense for the import.. As a first step though, the model in models could just not provide the data_type or provide the dbt one if it matches.

(For parsing the manifest, Dagster-dbt does this as well, and the code is Apache-2 Licensed, if you are looking for inspiration). The import is something I can contribute on, if the implementation sounds ok.


Much easier of course is to be pointed to a dbt schema.yaml file, and using that for importing the models. Anything not defined in that yaml file would be missed. Then again, maybe that's ok.

emirkmo avatar Mar 22 '24 14:03 emirkmo

I think the later is fine. As I presume most people with more than a few dbt models split them into a model per file otherwise it gets quite unwieldily very quickly. Either that or parse them all but allow an input to specify which models you want to include in the data contract as it could be you want to or three for a specific contract?

pixie79 avatar May 10 '24 07:05 pixie79

I think the later is fine. As I presume most people with more than a few dbt models split them into a model per file otherwise it gets quite unwieldily very quickly.

This does not match my experience with larger dbt projects. But one or several models can logically co exist and be part a data contract so it is fine anyway? (It’s reasonable to ask/expect to not mix models from different data products/contracts..)

emirkmo avatar May 13 '24 07:05 emirkmo

I'm looking into this right now

torbenkeller avatar Jun 07 '24 13:06 torbenkeller

Awesome! I assigned you the issue. :-)

simonharrer avatar Jun 07 '24 15:06 simonharrer

@torbenkeller any progress here?

jochenchrist avatar Jul 01 '24 10:07 jochenchrist

i've been working with dbt, maybe I can help

teoria avatar Jul 02 '24 20:07 teoria

@jochenchrist Was working on other things the last weeks, sorry. But I will continue on this.

@teoria sounds good, if you want we can pair program to get this ready

torbenkeller avatar Jul 02 '24 21:07 torbenkeller

@teoria you can contact me on the datacontract slack server

torbenkeller avatar Jul 03 '24 10:07 torbenkeller

Nice!

teoria avatar Jul 03 '24 21:07 teoria

first version of dbt manifest importer https://github.com/datacontract/datacontract-cli/pull/317

Usage: Contract with all dbt models $datacontract import --format dbt --source /path/manifest_dbt.json

Contract with only 2 models $datacontract import --format dbt --source /path/manifest_dbt.json --dbt-model orders --dbt-model customers

teoria avatar Jul 09 '24 15:07 teoria

Mark this as closed.

simonharrer avatar Jul 26 '24 07:07 simonharrer