datacontract-cli icon indicating copy to clipboard operation
datacontract-cli copied to clipboard

Separate model from data-contract-cli repository into a data-contract-model repo

Open julestruong opened this issue 10 months ago • 4 comments

Hello @jochenchrist @simonharrer

At BackMarket, we have built a tool that helps teams generating their data contract for their respective models. For the moment, we basically copy-paste the datacontract/model/data_contract_specification.py file in our own project.

We would like to use the datacontract models stored in the datacontract-cli project. That's why we contribute to this PR for instance. We are facing an issue regarding the dependencies of this project. Since it's not a repository for only models but rather a whole cli project, there is a lot of stuff in it. And those dependencies clashes with our own deps and we can't really use the lib after all 😞

I want to migrate the model files into a smaller "data-contract-models" repository that could benefit all Pythonist ecosystem !

From my point of view this would only be a move on the datacontract/model/data_contract_specification.py file into another repository. Feel free to ask more questions to see how we could help and enable the usage of data contract cli in our stack.

What do you think ?

Regards

julestruong avatar Feb 19 '25 10:02 julestruong

Hi julestruong,

Thanks for the suggestion. I’m a bit cautious about the complexity, but I do understand the issues with dependencies. We introduced extras to keep the dependencies for the core minimal. Which specific dependency is causing problems?

If we want to explore your suggestion further, it would make sense to create a Python package for the Data Contract Specification that simply publishes the specification as a Pydantic model for each version. However, I’m not entirely sure how this would work if we need to support multiple versions simultaneously in the CLI. Additionally, we would need to do the same for the ODCS specification.

jochenchrist avatar Feb 19 '25 14:02 jochenchrist

Options:

  • Option A: When we release, we release both the datacontract-cli and the datacontract-models as separate pip modules, but with the same version. We could use a release script that defines the release process for datacontract-models, without having to make this a really separate pip module that is used by datacontract-cli.
  • Option B: Monorepo with two subfolders, one for each of the pip modules.
  • Option C: Different git repositories for each pip module
  • Option D: Solve the dependency hell so that this is not necessary anymore

simonharrer avatar Feb 25 '25 09:02 simonharrer

my2cts: In most organizations, either a monorepo-with-submodules approach (Option B) or separate repos (Option C) tends to be the cleanest way to ensure that the CLI does not drag in dependencies for use cases that only need the model definitions.
So I would tend to restrict the choice between B and C, it often comes down to governance and release cadence:

  • If you want separate lifecycles and minimal cross-talk, choose separate repos: Option C.
  • If you still want a single home for issues and PRs but separate packages to not force the CLI dependencies on model-only consumers, a monorepo approach Option B is easier to manage for CI pipelines.

Moving “data_contract_specification.py” into its own installable Python package with minimal dependencies is the key.

fvaleye avatar Mar 05 '25 09:03 fvaleye

We chose separate git repos.

We started working on this. Feel free to have a look at https://github.com/datacontract/datacontract-specification-python and give feedback.

simonharrer avatar Mar 28 '25 16:03 simonharrer