hatch Monorepo support

Monorepo support

Open niderhoff opened this issue 2 years ago • 24 comments

Hello,

I appreciate the introduction of hatch and what it offers. But of course we are all looking for a build-tool which can do everything. So I ask, since it is very common with microservice based architectures, is the usage of Monrepos somehow a concern in the design of hatch?

I mean instead of a structure like this:

package
├── module
├── tests
└── pyproject.toml

having something like this:

package_a
├── module
├── tests
└── pyproject.toml
package_b
├── module
├── tests
└── pyproject.toml

with the requirement of

having also shared project dependecies, i.e. pytest and similar which are shared among all sub projects and are automatically installed in to each projects dev dependencies
quickly switching projects (i.e. virtual environments)?
executing tests automatically in each separate virtual environment?

May 11 '22 10:05 niderhoff

Hello!

shared project dependencies

You can add a custom metadata hook that would modify dependencies and store it at the root for use by all packages.

quickly switching projects (i.e. virtual environments)? executing tests automatically in each separate virtual environment?

You can switch to project mode (or aware mode) and add the paths to your monorepos to dirs.project:

hatch config set dirs.project "['/path/to/monorepo']"

Hatch always has project awareness, so you can then use the --project / -p CLI flag to select projects:

hatch -p package_a ..."

May 11 '22 14:05 ofek

Hey, thanks for your quick response here. It seems I fail to grasp the project-concept fully, but what I learned now is:

I can create multiple packages using hatch. Let's say I am in a root folder /path/to/monrepo.

hatch create project1
hatch create project2

this gives me the following folder structure:

monorepo/
├─ project1/
│  ├─ project1/
│  ├─ tests/
│  ├─ pyproject.toml
├─ project2/
│  ├─ project2/
│  ├─ tests
│  ├─ pyproject.toml

then I configure the monorepo, like so:

hatch config set dirs.project "['/path/to/monorepo']"

afterwards I can run commands from the root-folder like so:

hatch -p project1 run cov

So far so good. But the requirement for which I am looking for explicitly (and I don't really know how to achieve that yet) is the following:

Sitting in the Root folder as cwd (/path/to/monorepo) I want to execute commands like tests or linting in all subdirectories at once. Basically if I execute pytest . in the root folder, it will do that. However, since there is no root-venv, there is no pytest. So I preceded to create a 'root-venv' using hatch new --init at the root directory.

But now when running pytest . the rests will fail with ImportErrors, because the packages project1/project1 and project2/project2 aren't installed into the root venv.

Just a background: This requirement comes from IDE usage where when you open the monorepo as a root folder, it will fail to discover/execute tests in the sub-projects, unless everything is in 1 virtual environment.

May 13 '22 10:05 niderhoff

I have a similar use case to the one described here. I think it's worth describing, as a potential use case to reason about, even though the situation is less-than-ideal for a few reasons.

There is this project which has a source tree with multiple packages (in this specific case the packages are rooted in a single namespace package, but it might be irrelevant). Something like this:

repo_root
└── namespace
    ├── bar
    │   ├── __init__.py
    │   ├── aaa
    │   │   └── __init__.py
    │   └── bbb
    │       └── __init__.py
    └── foo
        ├── __init__.py
        └── something_foo.py

That's how this works currently (with custom scripts):

the code can be imported, executed, and tested from repo_root
each package could be a root to a project (in Hatch parlance), that is, it defines metadata, dependencies, and it could include sub-packages; and it's the unit of distribution
- metadata is not shown in the tree, but you can imagine a toml file located in the package directory
- projects can be in nested paths, e.g. you can have a/b as one project and a/b/c as a different one
projects are discovered from the tree, rather than defined in a single file; by default, the name is constructed from the import path from the root
projects can depend on other projects within the same tree
in some cases the user might require to share a virtual env for more than one project (assuming it can be constructed without conflicts)

We want to adopt PEP 517, and Hatch is interesting because of its extensibility. But in its current shape I don't think that Hatch can be extended to discover projects in the tree. I'm also not sure that I can convince Hatch(ling) that:

pyproject.toml lives in the package directory itself
dependencies to other projects in the same tree are handled specifically, e.g.
- if preparing a local virtualenv I want to ignore them, because I will import from the tree[^1]
- if building a wheel/sdist they are copied as-is
- (building an editable package is not really specified[^2])

Overall this pattern is not what I would recommend for a new project, but it still supports some real-world workflows. I'm not sure it would make sense for Hatch to support it out-of-the-box. Still, is it a legitimate use case for the plugin interface?

[^1]: this is a huge red flag over the whole pattern, because it explicitly ignores version specifiers for internal dependencies; respecting them might be possible but complicated [^2]: internal dependencies from editable packages might be cascaded as further editable packages (is that even possible?)

Jun 01 '22 13:06 sorcio

Hi I was looking for a way how to manage a monorepo with multiple AWS lambdas and how to combine that with Python best practices. My requirements were:

usage of src/ folder
only one pyproject for the whole repo
ability to build lambdas with separated requirements for each

Thanks to great modularity of Hatch I was able to come up with custom builder called hatch-aws. You can use it to build AWS lambdas. I plan to add publish option in future.

Hopes somebody else finds it useful too.

Aug 22 '22 21:08 aka-raccoon

I also am interested in this. More precisely, I'm looking for something like Cargo workspaces that let me:

Have a single top level lockfile.
Establish interdependencies between libraries/apps and let me install/build a single app/library (e.g. install a single micro service and all of its dependencies in a container image).
Can run tests on all packages (cargo test) or a single package (cargo test -p add_one). This is more of a nice to have since (ideally) you can always pushd add_one; pytest; popd.

I realize this may technically be possible with Hatch (based on https://github.com/pypa/hatch/issues/233#issuecomment-1123820713) but it seems like custom stuff and a good chunk of knowledge is required to make it work. It would be really nice to see a self contained writeup or plugin.

Sep 08 '22 16:09 adriangb

Definite +1, would definitely be nice to be able to do this in python.

As well as cargo workspaces, other points of reference are https://docs.npmjs.com/cli/v7/using-npm/workspaces / https://classic.yarnpkg.com/lang/en/docs/workspaces/ (plus other tools in JS like lerna and turbo), which I imagine Rust took inspiration from

Nov 28 '22 21:11 chrisjsewell

We are also looking for something similar in Airflow .

I've learned about Hatch and future monorepo support from the Talk Python to me podcast https://talkpython.fm/episodes/show/408/hatch-a-modern-python-workflow and in Arflow we are having a monorepo with - effectively ~90 packages and we would be likely glad to help in developing it and definitely be the "test-drive" for it.

For many years we run our own custom version of something that is discussed here:

we release up to 90 packages (airflow + providers) from the single monorepo (regularly 10-20 packages twice a month but sometimes all 90 of them).
we have a support for automatically updated set of constraints that provide a single "consistent" set of dependencies (including transitive ones) that represent not only "development" snapshot of those, but also provide a "golden" set of constraints fo our users to install latest airflow + latest providers.
we have breeze development environment that uses docker-compose and comples tooling that allows the developers to synchronize to latest "blessed" (i.e. passing all our unit tests) dependency versions
we have a complex CI tooling that uses pip eager-upgrade feature to automatically resolve our ~700 (!) transitive dependencies to the latest "consistent" set of those. We have sometime 10-20 packages updated there and we keep all the history of those updates from our Automated CI - the history of those constraint updates is kept in our github repo and used from there: https://github.com/apache/airflow/commits/constraints-main)
and quite a few other thigns - such as all the "regular" PRs are automatically using the latest "blessed" development constraints to isolate contributors (w merge 80 -100 PRs a week) from those automated dependency upgrade side-effects

We are eyeing https://github.com/apache/airflow/issues/33909 a way to organize our repo a bit better (i.e. rather than generate those packages dynamically on-the-flight) we want to restructure our repo to actually have all those packages as individual "standalone" packages in the repo - and not just sub-packages of the main "airflow" package than we would then split out using our own tooling and scripts following our own conventions. That got us through the last 3 years while the standard Python packaging was not yet capable of handling our case, but seeing what Hatch is aiming at, I am personally excited that maybe we will be able to get rid some of the ugly part of our tooling (which mostly I personally wrote and iterated over through the years.

Can we help somehow? Is there some concerted effort to make this happen ? We would love not only to "use" it but also maybe help in developing it (we have quite a number of talented engineers as maintainers and I am sure we can help).

We have the Airflow Summit - coming in September and some "resting" planned after that - and some of the "cleanup" to be done in Airflow to make it possible - but maybe there is a chance we could form some task-force to make it happen afterwards (I personally will be much more "free" to help starting 2nd half o October?

Aug 30 '23 10:08 potiuk

The initial examples in this ticket seem to say that they're ok with hatch creating a separate virtualenv for each project. I'm interested in a workspace with all projects installed into the same virtualenv - this would be much cleaner for local development and is more similar to how a yarn/npm workspace works. Then I can pin the dependencies for the entire workspace, my IDE can use that virtualenv, pytest can import everything, etc.

Aug 30 '23 15:08 mmerickel

@mmerickel: But this would cause massive dependency bleeding, ie. you'll not be able to easily keep track of which dependencies are required by which subproject, right? In this case it's just a root-level pyproject.toml with a proper (pytest) configuration to let the tools know where to look.

Dec 05 '23 08:12 LordFckHelmchen

But this would cause massive dependency bleeding, ie. you'll not be able to easily keep track of which dependencies are required by which subproject, right? In this case it's just a root-level pyproject.toml with a proper (pytest) configuration to let the tools know where to look.

It's not bleeding to install the workspace dependencies into one virtualenv. It's necessary to let them work together, which is what you generally want in a workspace where you have a "project" created from a bunch of python packages that you want to develop together. It does make it problematic sometimes to remember to put the right dependencies on a package, but that's the nature of how python dependencies work. It's not worth losing the ability to install all projects in the workspace into a single virtualenv for all other purposes.

At the very least I want the ability to define a virtualenv that includes "package A" and if it depends on "package-b" which is defined in the workspace, hatch should be able to find it and install it in editable mode instead of trying to find it on PyPI. That would enable me to define several virtualenvs with different combinations of packages from the workspace which is nice.

Dec 07 '23 00:12 mmerickel

But this would cause massive dependency bleeding, ie. you'll not be able to easily keep track of which dependencies are required by which subproject, right? In this case it's just a root-level pyproject.toml with a proper (pytest) configuration to let the tools know where to look.

No not really. Exactly what @mmerickel explained.

The whole idea is that you can easily come up with the "common" set of dependenciess that are result of merging those and running them together, while being able to have "editable" sources in all the installed packages.

Generally you will have to figure out what is the "best" set of those dependencies (in case of Airflow from the example above https://github.com/pypa/hatch/issues/233#issuecomment-1698943770 - we jut combine all dependencies from all the packages that we have (airflow + 80+ provider packages) and let pip figure the best set of dependencies with --eager-upgrade (we currently do it in a shared CI container image and we use a lot of caching to speed thing sup so in our main build we automatically upgrade to the "new" set of dependencies from the "last known good one".

Maybe a real-life example from Airflow woudl be helpful here:

To deal with the problem of having effectively 80+ packages we had to implement some awful hacks.

So far what we do is we have a very complex setup.py file that generates devel-all extra. This devel-all extrach puts together all dependencies from all packages. Those dependencies for each of the sub-packages are declared in absolutely NOT standard way in our own provider.yaml file. Each provider has separate provider.yaml where they are defined. It could be converted to pyproject.toml, but the way how want to be able to install all the packages together makes pyproject.toml useless for anything else than to keep dependencies there.

In order to allow local development and single pip -e . installing airflow + all providers togeter, we implemented some terrible hacks and our mono-repo structure does not make it easy to use standard pyproject.toml placed exclusively in the package it serves (or so I think).

For local development those provider packages are put in "providers" "namespace" package and in our source repo we keep them all together in the same "source" structure as airflow. Our monorepo currently looks like that:

setup.py      <--- super complex installation logic that gathers dependencies from provider.yaml files (via pre-commit generated .json file) 
airflow          <- airflow code here, this is a regular package with __init__.py
   |---- models         <- example regular airflow package __init__.py
   |---- providers       <- no __init__.py here, this is a namespace package
        | ---- amazon          <- __init__.py here, this is a regular package where amazon provider code is kept
               |---- provider.yaml        <-- here amazon dependencies are declared
               | --- hooks           <- __init__.py here
                           | --- amazon_hook.py      <- imported as "from airflow.providers.amazon.hook import amazon_hook"
        | ----google        <- __init__.py here this is a regular package where google provider code is kept
                |---- provider.yaml        <-- here google dependencies are declared

Howe it works for our local development, when we want to develop on any of the providers and airflow at the same time, we do:

INSTALL_PROVIDERS_FROM_SOURCES="true" pip install -e ".[amazon,google]".

So when we are installing it locally for development, we are effectively installing airflow + all provider sources + dependencies that we specify via extras. We also have to implement a hack "INSTALL_PROVIDERS_FROM_SOURCES" env variable hack to avoid the main package to pull some of the providers from pip rather than using them directly from sources.

This is all super-hacky and complex. For example in order to build provider package, we need to effectively copy the code of the provider to a new source tree, generate pyproject.toml there for this provider and build the package from there. We have it all automated and it works nicely for years but I would love to convert all those providers to be regular packages (even if we keep them in monorepo).

We cannot do that (I believe):

pyproject.toml       <-- dependencies for airflow defined here
airflow        <- airflow code here, this is a regular package with __init__.py
   |---- models       <- example regular airflow package __init__.py
   |---- providers       <- no __init__.py here, this is a namespace package
        | ---- amazon        <- __init__.py here, this is a regular package where amazon provider code is kept
               | --- hooks        <- __init__.py here
                           | --- amazon_hook.py       <- imported as "from airflow.providers.amazon.hook import amazon_hook" 
               |---- pyproject.toml        <-- dependencies for amazon defined here
        | ----google       <- __init__.py here this is a regular package where google provider code is kept
              |---- pyproject.toml       <-- dependencies for google defined here

That would not work, because

a) pyproject.toml is declarative and we cannot do dynamic calculations of what is defined in dependent pyproject.toml (probably we could actually generate pyproject.toml with pre-commit so this is not a bit issue b) - more importantly - having pyproject.toml defined in a sub-package of the project effectively is not possible (and it would be super confusing) . I cannot imagine having "apache.airflow.providers.amazon" package defined via pyproject.toml where the top level code (relative to pyproject.toml) should be imported with from apache.airlfow.providers.amazon". I think a number of tools and installers would be quite confused by the fact that the "root" of PYTHONPATH is actually 3 levels above where pyproject.toml` is defined.

But maybe I am wrong and this is entirely normal and supported ?

If I am right, then I believe we need smth like that:

pyproject.toml <-- dependencies for airflow defined here
airflow <- airflow code here, this is a regular package with __init__.py
     |---- models <- part or  regular airflow package __init__.py
     |---- 
providers
    | ---- amazon
    |         |  pyproject.toml     <-- dependencies for amazon defined here
    |         |  -----airflow      <- regular package with __init__.py (might be namespace actually)
    |         |            |-------- providers     <- regular package with __init__.py (might be namespace actually)
    |         |                             | ------ amazon      <- regular package with __init__.py
    | ---- google
    |         |  pyproject.toml       <-- dependencies for google defined here
    |         |  -----airflow        <- regular package with __init__.py (might be namespace actually)
    |         |            |-------- providers       <- regular package with __init__.py (might be namespace actually)
    |         |                              | ------ google      <- regular package with __init__.py

Then - each project would have completely separate subfolder and be "regular" python package that I could just install independently for editable work like this:

pip install -e  providers/amazon

While to install airflow as "editable" I need to do this:

pip install -e .

And what I am looking for is a "standard" way to say: "install airflow + those providers + all the dependencies of theirs and I want airflow and provider code to be editable"

Maybe:

pip install -e . --with-subprojects amazon google

pip install -e . --with-subprojects all

Where I end up with virtualenv containing airflow + all chosen subfolders in editable mode + all dependencies of both airflow and all of the selected subprojects installed.

Dec 10 '23 14:12 potiuk

All of this feedback is incredibly useful and will directly assist me in creating the workspaces feature!

Dec 10 '23 16:12 ofek

All of this feedback is incredibly useful and will directly assist me in creating the workspaces feature!

Happy to help if you are open to it :).

Dec 10 '23 17:12 potiuk

👋 I am new to Hatch and am learning about the features and possibilities with the tool. Since this is a conversation about Monorepos, I hope it is okay to share some of the work I am doing with a Monorepo-specific architecture called Polylith here.

Previously, Polylith has been a Poetry feature only (built as a Poetry plugin). Yesterday I released a new tool called polylith-cli that makes it possible to use Polylith with Python and Hatch. It is an early release with some parts missing, and I have also things to add in the docs 😄

In short, it is about sharing code between projects (the artifacts to build and deploy) in a really simple way and with a "single project/single repo" developer experience.

Just now, I recorded a quick intro to the tool, with a live demo and the Polylith Monorepo support the tooling is adding, using Hatch features that I have learned: https://youtu.be/K__3Uah3by0

Jan 16 '24 16:01 DavidVujic

@DavidVujic -> It does look promising, it's a bit difficult to wrap your head around bases/components (especially that the base is not realy a term I've seen used outside of polylith architecture I think), but yeah - that seems to be worth looking at (I think I will - in the coming months).

Jan 16 '24 17:01 potiuk

@potiuk I understand that it is a new term! It borrows ideas from LEGO: a base is just like a LEGO base plate, where you can add building blocks/bricks (components in Polylith) to build something useful 😄

Jan 16 '24 19:01 DavidVujic

any plan when to implement this workspace feature ?

Jan 17 '24 09:01 chuckliu1979

I have created a sandbox where I have tried out a monorepo approach to publish multiple packages from one git repo: See hatch-monorepo-sandbox So far it worked for my use case.

Mar 02 '24 15:03 manuel-koch

hatch hatch copied to clipboard

Monorepo support

hatch
hatch copied to clipboard