hatch
hatch copied to clipboard
Monorepo support
Hello,
I appreciate the introduction of hatch and what it offers. But of course we are all looking for a build-tool which can do everything. So I ask, since it is very common with microservice based architectures, is the usage of Monrepos somehow a concern in the design of hatch?
I mean instead of a structure like this:
package
├── module
├── tests
└── pyproject.toml
having something like this:
package_a
├── module
├── tests
└── pyproject.toml
package_b
├── module
├── tests
└── pyproject.toml
with the requirement of
- having also shared project dependecies, i.e.
pytestand similar which are shared among all sub projects and are automatically installed in to each projects dev dependencies - quickly switching projects (i.e. virtual environments)?
- executing tests automatically in each separate virtual environment?
Hello!
shared project dependencies
You can add a custom metadata hook that would modify dependencies and store it at the root for use by all packages.
quickly switching projects (i.e. virtual environments)? executing tests automatically in each separate virtual environment?
You can switch to project mode (or aware mode) and add the paths to your monorepos to dirs.project:
hatch config set dirs.project "['/path/to/monorepo']"
Hatch always has project awareness, so you can then use the --project / -p CLI flag to select projects:
hatch -p package_a ..."
Hey, thanks for your quick response here. It seems I fail to grasp the project-concept fully, but what I learned now is:
I can create multiple packages using hatch. Let's say I am in a root folder /path/to/monrepo.
hatch create project1
hatch create project2
this gives me the following folder structure:
monorepo/
├─ project1/
│ ├─ project1/
│ ├─ tests/
│ ├─ pyproject.toml
├─ project2/
│ ├─ project2/
│ ├─ tests
│ ├─ pyproject.toml
then I configure the monorepo, like so:
hatch config set dirs.project "['/path/to/monorepo']"
afterwards I can run commands from the root-folder like so:
hatch -p project1 run cov
So far so good. But the requirement for which I am looking for explicitly (and I don't really know how to achieve that yet) is the following:
Sitting in the Root folder as cwd (/path/to/monorepo) I want to execute commands like tests or linting in all subdirectories at once. Basically if I execute pytest . in the root folder, it will do that. However, since there is no root-venv, there is no pytest. So I preceded to create a 'root-venv' using hatch new --init at the root directory.
But now when running pytest . the rests will fail with ImportErrors, because the packages project1/project1 and project2/project2 aren't installed into the root venv.
Just a background: This requirement comes from IDE usage where when you open the monorepo as a root folder, it will fail to discover/execute tests in the sub-projects, unless everything is in 1 virtual environment.
I have a similar use case to the one described here. I think it's worth describing, as a potential use case to reason about, even though the situation is less-than-ideal for a few reasons.
There is this project which has a source tree with multiple packages (in this specific case the packages are rooted in a single namespace package, but it might be irrelevant). Something like this:
repo_root
└── namespace
├── bar
│ ├── __init__.py
│ ├── aaa
│ │ └── __init__.py
│ └── bbb
│ └── __init__.py
└── foo
├── __init__.py
└── something_foo.py
That's how this works currently (with custom scripts):
- the code can be imported, executed, and tested from
repo_root - each package could be a root to a project (in Hatch parlance), that is, it defines metadata, dependencies, and it could include sub-packages; and it's the unit of distribution
- metadata is not shown in the tree, but you can imagine a toml file located in the package directory
- projects can be in nested paths, e.g. you can have
a/bas one project anda/b/cas a different one
- projects are discovered from the tree, rather than defined in a single file; by default, the name is constructed from the import path from the root
- projects can depend on other projects within the same tree
- in some cases the user might require to share a virtual env for more than one project (assuming it can be constructed without conflicts)
We want to adopt PEP 517, and Hatch is interesting because of its extensibility. But in its current shape I don't think that Hatch can be extended to discover projects in the tree. I'm also not sure that I can convince Hatch(ling) that:
- pyproject.toml lives in the package directory itself
- dependencies to other projects in the same tree are handled specifically, e.g.
- if preparing a local virtualenv I want to ignore them, because I will import from the tree[^1]
- if building a wheel/sdist they are copied as-is
- (building an editable package is not really specified[^2])
Overall this pattern is not what I would recommend for a new project, but it still supports some real-world workflows. I'm not sure it would make sense for Hatch to support it out-of-the-box. Still, is it a legitimate use case for the plugin interface?
[^1]: this is a huge red flag over the whole pattern, because it explicitly ignores version specifiers for internal dependencies; respecting them might be possible but complicated [^2]: internal dependencies from editable packages might be cascaded as further editable packages (is that even possible?)
Hi I was looking for a way how to manage a monorepo with multiple AWS lambdas and how to combine that with Python best practices. My requirements were:
- usage of
src/folder - only one
pyprojectfor the whole repo - ability to build lambdas with separated requirements for each
Thanks to great modularity of Hatch I was able to come up with custom builder called hatch-aws. You can use it to build AWS lambdas. I plan to add publish option in future.
Hopes somebody else finds it useful too.
I also am interested in this. More precisely, I'm looking for something like Cargo workspaces that let me:
- Have a single top level lockfile.
- Establish interdependencies between libraries/apps and let me install/build a single app/library (e.g. install a single micro service and all of its dependencies in a container image).
- Can run tests on all packages (
cargo test) or a single package (cargo test -p add_one). This is more of a nice to have since (ideally) you can alwayspushd add_one; pytest; popd.
I realize this may technically be possible with Hatch (based on https://github.com/pypa/hatch/issues/233#issuecomment-1123820713) but it seems like custom stuff and a good chunk of knowledge is required to make it work. It would be really nice to see a self contained writeup or plugin.
Definite +1, would definitely be nice to be able to do this in python.
As well as cargo workspaces, other points of reference are https://docs.npmjs.com/cli/v7/using-npm/workspaces / https://classic.yarnpkg.com/lang/en/docs/workspaces/ (plus other tools in JS like lerna and turbo), which I imagine Rust took inspiration from
We are also looking for something similar in Airflow .
I've learned about Hatch and future monorepo support from the Talk Python to me podcast https://talkpython.fm/episodes/show/408/hatch-a-modern-python-workflow and in Arflow we are having a monorepo with - effectively ~90 packages and we would be likely glad to help in developing it and definitely be the "test-drive" for it.
For many years we run our own custom version of something that is discussed here:
-
we release up to 90 packages (airflow + providers) from the single monorepo (regularly 10-20 packages twice a month but sometimes all 90 of them).
-
we have a support for automatically updated set of constraints that provide a single "consistent" set of dependencies (including transitive ones) that represent not only "development" snapshot of those, but also provide a "golden" set of constraints fo our users to install latest airflow + latest providers.
-
we have
breezedevelopment environment that uses docker-compose and comples tooling that allows the developers to synchronize to latest "blessed" (i.e. passing all our unit tests) dependency versions -
we have a complex CI tooling that uses
pip eager-upgradefeature to automatically resolve our ~700 (!) transitive dependencies to the latest "consistent" set of those. We have sometime 10-20 packages updated there and we keep all the history of those updates from our Automated CI - the history of those constraint updates is kept in our github repo and used from there: https://github.com/apache/airflow/commits/constraints-main) -
and quite a few other thigns - such as all the "regular" PRs are automatically using the latest "blessed" development constraints to isolate contributors (w merge 80 -100 PRs a week) from those automated dependency upgrade side-effects
We are eyeing https://github.com/apache/airflow/issues/33909 a way to organize our repo a bit better (i.e. rather than generate those packages dynamically on-the-flight) we want to restructure our repo to actually have all those packages as individual "standalone" packages in the repo - and not just sub-packages of the main "airflow" package than we would then split out using our own tooling and scripts following our own conventions. That got us through the last 3 years while the standard Python packaging was not yet capable of handling our case, but seeing what Hatch is aiming at, I am personally excited that maybe we will be able to get rid some of the ugly part of our tooling (which mostly I personally wrote and iterated over through the years.
Can we help somehow? Is there some concerted effort to make this happen ? We would love not only to "use" it but also maybe help in developing it (we have quite a number of talented engineers as maintainers and I am sure we can help).
We have the Airflow Summit - coming in September and some "resting" planned after that - and some of the "cleanup" to be done in Airflow to make it possible - but maybe there is a chance we could form some task-force to make it happen afterwards (I personally will be much more "free" to help starting 2nd half o October?
The initial examples in this ticket seem to say that they're ok with hatch creating a separate virtualenv for each project. I'm interested in a workspace with all projects installed into the same virtualenv - this would be much cleaner for local development and is more similar to how a yarn/npm workspace works. Then I can pin the dependencies for the entire workspace, my IDE can use that virtualenv, pytest can import everything, etc.
@mmerickel: But this would cause massive dependency bleeding, ie. you'll not be able to easily keep track of which dependencies are required by which subproject, right? In this case it's just a root-level pyproject.toml with a proper (pytest) configuration to let the tools know where to look.
But this would cause massive dependency bleeding, ie. you'll not be able to easily keep track of which dependencies are required by which subproject, right? In this case it's just a root-level pyproject.toml with a proper (pytest) configuration to let the tools know where to look.
It's not bleeding to install the workspace dependencies into one virtualenv. It's necessary to let them work together, which is what you generally want in a workspace where you have a "project" created from a bunch of python packages that you want to develop together. It does make it problematic sometimes to remember to put the right dependencies on a package, but that's the nature of how python dependencies work. It's not worth losing the ability to install all projects in the workspace into a single virtualenv for all other purposes.
At the very least I want the ability to define a virtualenv that includes "package A" and if it depends on "package-b" which is defined in the workspace, hatch should be able to find it and install it in editable mode instead of trying to find it on PyPI. That would enable me to define several virtualenvs with different combinations of packages from the workspace which is nice.
But this would cause massive dependency bleeding, ie. you'll not be able to easily keep track of which dependencies are required by which subproject, right? In this case it's just a root-level pyproject.toml with a proper (pytest) configuration to let the tools know where to look.
No not really. Exactly what @mmerickel explained.
The whole idea is that you can easily come up with the "common" set of dependenciess that are result of merging those and running them together, while being able to have "editable" sources in all the installed packages.
Generally you will have to figure out what is the "best" set of those dependencies (in case of Airflow from the example above https://github.com/pypa/hatch/issues/233#issuecomment-1698943770 - we jut combine all dependencies from all the packages that we have (airflow + 80+ provider packages) and let pip figure the best set of dependencies with --eager-upgrade (we currently do it in a shared CI container image and we use a lot of caching to speed thing sup so in our main build we automatically upgrade to the "new" set of dependencies from the "last known good one".
Maybe a real-life example from Airflow woudl be helpful here:
To deal with the problem of having effectively 80+ packages we had to implement some awful hacks.
So far what we do is we have a very complex setup.py file that generates devel-all extra. This devel-all extrach puts together all dependencies from all packages. Those dependencies for each of the sub-packages are declared in absolutely NOT standard way in our own provider.yaml file. Each provider has separate provider.yaml where they are defined. It could be converted to pyproject.toml, but the way how want to be able to install all the packages together makes pyproject.toml useless for anything else than to keep dependencies there.
In order to allow local development and single pip -e . installing airflow + all providers togeter, we implemented some terrible hacks and our mono-repo structure does not make it easy to use standard pyproject.toml placed exclusively in the package it serves (or so I think).
For local development those provider packages are put in "providers" "namespace" package and in our source repo we keep them all together in the same "source" structure as airflow. Our monorepo currently looks like that:
setup.py <--- super complex installation logic that gathers dependencies from provider.yaml files (via pre-commit generated .json file)
airflow <- airflow code here, this is a regular package with __init__.py
|---- models <- example regular airflow package __init__.py
|---- providers <- no __init__.py here, this is a namespace package
| ---- amazon <- __init__.py here, this is a regular package where amazon provider code is kept
|---- provider.yaml <-- here amazon dependencies are declared
| --- hooks <- __init__.py here
| --- amazon_hook.py <- imported as "from airflow.providers.amazon.hook import amazon_hook"
| ----google <- __init__.py here this is a regular package where google provider code is kept
|---- provider.yaml <-- here google dependencies are declared
Howe it works for our local development, when we want to develop on any of the providers and airflow at the same time, we do:
INSTALL_PROVIDERS_FROM_SOURCES="true" pip install -e ".[amazon,google]".
So when we are installing it locally for development, we are effectively installing airflow + all provider sources + dependencies that we specify via extras. We also have to implement a hack "INSTALL_PROVIDERS_FROM_SOURCES" env variable hack to avoid the main package to pull some of the providers from pip rather than using them directly from sources.
This is all super-hacky and complex. For example in order to build provider package, we need to effectively copy the code of the provider to a new source tree, generate pyproject.toml there for this provider and build the package from there. We have it all automated and it works nicely for years but I would love to convert all those providers to be regular packages (even if we keep them in monorepo).
We cannot do that (I believe):
pyproject.toml <-- dependencies for airflow defined here
airflow <- airflow code here, this is a regular package with __init__.py
|---- models <- example regular airflow package __init__.py
|---- providers <- no __init__.py here, this is a namespace package
| ---- amazon <- __init__.py here, this is a regular package where amazon provider code is kept
| --- hooks <- __init__.py here
| --- amazon_hook.py <- imported as "from airflow.providers.amazon.hook import amazon_hook"
|---- pyproject.toml <-- dependencies for amazon defined here
| ----google <- __init__.py here this is a regular package where google provider code is kept
|---- pyproject.toml <-- dependencies for google defined here
That would not work, because
a) pyproject.toml is declarative and we cannot do dynamic calculations of what is defined in dependent pyproject.toml (probably we could actually generate pyproject.toml with pre-commit so this is not a bit issue
b) - more importantly - having pyproject.toml defined in a sub-package of the project effectively is not possible (and it would be super confusing) . I cannot imagine having "apache.airflow.providers.amazon" package defined via pyproject.toml where the top level code (relative to pyproject.toml) should be imported with from apache.airlfow.providers.amazon". I think a number of tools and installers would be quite confused by the fact that the "root" of PYTHONPATH is actually 3 levels above where pyproject.toml` is defined.
But maybe I am wrong and this is entirely normal and supported ?
If I am right, then I believe we need smth like that:
pyproject.toml <-- dependencies for airflow defined here
airflow <- airflow code here, this is a regular package with __init__.py
|---- models <- part or regular airflow package __init__.py
|----
providers
| ---- amazon
| | pyproject.toml <-- dependencies for amazon defined here
| | -----airflow <- regular package with __init__.py (might be namespace actually)
| | |-------- providers <- regular package with __init__.py (might be namespace actually)
| | | ------ amazon <- regular package with __init__.py
| ---- google
| | pyproject.toml <-- dependencies for google defined here
| | -----airflow <- regular package with __init__.py (might be namespace actually)
| | |-------- providers <- regular package with __init__.py (might be namespace actually)
| | | ------ google <- regular package with __init__.py
Then - each project would have completely separate subfolder and be "regular" python package that I could just install independently for editable work like this:
pip install -e providers/amazon
While to install airflow as "editable" I need to do this:
pip install -e .
And what I am looking for is a "standard" way to say: "install airflow + those providers + all the dependencies of theirs and I want airflow and provider code to be editable"
Maybe:
pip install -e . --with-subprojects amazon google
pip install -e . --with-subprojects all
Where I end up with virtualenv containing airflow + all chosen subfolders in editable mode + all dependencies of both airflow and all of the selected subprojects installed.
All of this feedback is incredibly useful and will directly assist me in creating the workspaces feature!
All of this feedback is incredibly useful and will directly assist me in creating the workspaces feature!
Happy to help if you are open to it :).
👋 I am new to Hatch and am learning about the features and possibilities with the tool. Since this is a conversation about Monorepos, I hope it is okay to share some of the work I am doing with a Monorepo-specific architecture called Polylith here.
Previously, Polylith has been a Poetry feature only (built as a Poetry plugin). Yesterday I released a new tool called polylith-cli that makes it possible to use Polylith with Python and Hatch. It is an early release with some parts missing, and I have also things to add in the docs 😄
In short, it is about sharing code between projects (the artifacts to build and deploy) in a really simple way and with a "single project/single repo" developer experience.
Just now, I recorded a quick intro to the tool, with a live demo and the Polylith Monorepo support the tooling is adding, using Hatch features that I have learned: https://youtu.be/K__3Uah3by0
@DavidVujic -> It does look promising, it's a bit difficult to wrap your head around bases/components (especially that the base is not realy a term I've seen used outside of polylith architecture I think), but yeah - that seems to be worth looking at (I think I will - in the coming months).
@potiuk I understand that it is a new term! It borrows ideas from LEGO: a base is just like a LEGO base plate, where you can add building blocks/bricks (components in Polylith) to build something useful 😄
any plan when to implement this workspace feature ?
I have created a sandbox where I have tried out a monorepo approach to publish multiple packages from one git repo: See hatch-monorepo-sandbox So far it worked for my use case.