community
community copied to clipboard
Merge dask and distributed repos?
I frequently feel pain from having two distinct repositories with dask/dask and dask/distributed. Lately we've been working much more on changes that affect both repos and synchronizing PRs across repos is painful and cumbersome. With the addition of dask-expr this adds to a third repo and there are occasionally changes that span all three repos (e.g. sending Expr classes to the scheduler without materializing client side).
Additionally, documentation, maintenance and release procedures add additional work per repo.
The code is currently hard locked anyhow so we essentially sacrificed almost all flexibility of having multiple repos already and are pretty much paying for the disadvantage.
I would like to propose to merge the two (three) repos into a single one. We should still maintain multiple python packages so nothing would change for the end user other than having a single issue tracker to report issues to.
The problems I suspect we'll be running into are
- CI runtime for distributed is relatively high. We can remove some redundant tests once both are in the same repo but would still have longer runtime if everything is tested. We'd likely still maintain separate github workflows for testing that use appropriate paths/paths-ignore to somewhat decouple this.
- CI of distributed is flaky and has been for a long time. However, if the code only runs selectively depending on the path changes, this would not change by merging the repos
- This would likely impact all existing PRs since we'd likely have to change directory structure to support multiple
pyproject.tomlfiles
Are there problems I haven't thought about? Any other reasons why the two code bases should remain separate? I'm not very familiar with packaging. Is there anything in this realm that needs consideration?
cc @mrocklin @jacobtomlinson @quasiben @jrbourbeau @rjzamora @charlesbluca @hendrikmakait @phofl
I think merging dask-expr into dask is an easy win. My understanding was that this would always be the goal anyway.
Merging distributed in sounds super painful given the long git history, open issues and PRs. Also the distributed CI is very slow and flaky, so I would expect this is going to cause pain for dask/dask contributors. We would need to set up a lot more rules to only trigger certain workflows on certain file changes which would increase CI complexity even further. It's less clear to me that this is a good idea.
We would need to set up a lot more rules to only trigger certain workflows on certain file changes which would increase CI complexity even further. It's less clear to me that this is a good idea.
As long as the two packages are still separated this should be easy with two distinct workflow files that target the respective directories
Sure if the source was completely separate then you could do that, but what value do you get from bringing things together if they are still separate? I guess you don't need to make two-part PRs any more and can change things in both packages in a single PR. I can definitely see the appeal of that.
When working on distirbuted I blame and bisect a lot to figure things out, so we would have to be careful not to lose the history. But bringing everything into one repo would definitely make bisecting easier.
Bringing the issues over from distributed would be more challenging, but maybe we can write a script to transfer them? And all PRs would have to be abandoned, but maybe that's not a terrible thing.
Sure if the source was completely separate then you could do that, but what value do you get from bringing things together if they are still separate? I guess you don't need to make two-part PRs any more and can change things in both packages in a single PR. I can definitely see the appeal of that.
Yes, that's currently the primary motivation. Eventually I might also be interested to talk about nuking distributed as a dedicated package but I'm not there, yet and I figure this is a small step that does not bar any future direction.
When working on distirbuted I blame and bisect a lot to figure things out, so we would have to be careful not to lose the history. But bringing everything into one repo would definitely make bisecting easier.
Yes, I do that a lot, too. I would also only want to do this if we preserve the commit history of both repos. From the top of my head, I don't know how but I've done something similar in the past so it should be possible.
Bringing the issues over from distributed would be more challenging, but maybe we can write a script to transfer them? And all PRs would have to be abandoned, but maybe that's not a terrible thing.
Transferring issues wouldn't be a problem. Although I'm not sure if that's a sensible thing to do. We have 1.3k open issues in distributed. I bet that only a fraction isn't stale and actionable. (I'm also happy to introduce a stale bot first if that's a concern)
And all PRs would have to be abandoned, but maybe that's not a terrible thing.
Indeed but I don't think that'd be a terrible thing. I would assume that anybody who has decent knowledge with git could salvage a PR.
When working on distirbuted I blame and bisect a lot to figure things out, so we would have to be careful not to lose the history. But bringing everything into one repo would definitely make bisecting easier.
Yes, I do that a lot, too. I would also only want to do this if we preserve the commit history of both repos. From the top of my head, I don't know how but I've done something similar in the past so it should be possible.
Merging git repos is definitely possible, see this SO answer for example. I gave it a quick try and much of it was successful, notably and not unexpectedly there are conflicts with files that exist in both repos, such as GH files, CI, some docs, pyproject.toml, etc., full list below.
Merge conflicts
.flake8
.git-blame-ignore-revs
.github/PULL_REQUEST_TEMPLATE.md
.github/workflows/conda.yml
.github/workflows/publish-test-results.yaml
.github/workflows/release-drafter.yml
.github/workflows/release-publish.yml
.github/workflows/test-report.yaml
.gitignore
.pre-commit-config.yaml
.readthedocs.yaml
CODEOWNERS
CONTRIBUTING.md
LICENSE.txt
MANIFEST.in
README.rst
codecov.yml
conftest.py
continuous_integration/environment-3.10.yaml
continuous_integration/environment-3.11.yaml
continuous_integration/environment-3.12.yaml
continuous_integration/gpuci/build.sh
docs/Makefile
docs/make.bat
docs/release-procedure.md
docs/source/api.rst
docs/source/changelog.rst
docs/source/conf.py
docs/source/develop.rst
docs/source/faq.rst
docs/source/index.rst
docs/source/install.rst
docs/source/prometheus.rst
pyproject.toml
setup.py
So it would probably take someone knowledgeable of both repos at least a few hours to carefully go through conflicts carefully to prevent breaking anything, plus renaming files in their own directories. With all this said, merging doesn't look impossible for Dask+Distributed, if all the other aspects (like open issues, PRs, etc.) are resolved in a satisfactorily manner for everyone.
Yeah stalebot and then transfer the rest would be a good move.
+1 on this. Recently, we've had an uptick in PR or issues that two or even all three repos. Having everything bundled up in a single repo would facilitate these changes, and it sounds like there is a path forward that has little downside.
Earlier today we merged the dask-expr repo into dask/dask, see https://github.com/dask/dask/pull/11623
dask/dask now includes the entire commit history of dask-expr. There are still a couple of cleanup tasks to be done and we'll have to archive the dask-expr repo (and deal with the issues one way or the other) but the code migration is done.
For the record, the dask-expr merge was done by following this blog post https://gfscott.com/blog/merge-git-repos-and-keep-commit-history/ to preserve the git history (probably similar to the SO post that was already recommended above)
Thinking about the distributed merge a little more thoroughly I am wondering mostly about packaging. I'm not sure if there is an elegant solution to create two wheels/packages from one repo. There are ways of doing this assuming that we'd also move the dask/dask code to a different directory but I haven't found any way doing this while preserving the directory structure as is.
What I propose for a distributed merge (assuming everybody is on board) would be...
- Merge dask/distributed into dask/dask including full commit history
- The distributed code would sit in the directory
dask/distributedand would therefore be packaged as part of the dask wheel / dask-core package - The extra dependencies for
distributedwould adopt the additional dependencies of distributed - The code in the dask/distributed subdirectory would verify that all necessary dependencies are installed and would raise otherwise
- We would cut one more distributed wheel / distributed feedstock release that would lift pinning to the dask version and would redirect imports to the main package dask.distributed. The distributed package would therefore merely serve as a meta package that ships additional dependencies (and redirect an import). We may or may not deprecated the import redirect eventually but I don't see it a pressing concern to do right away (unless there are actual problems with it)
- Regarding the issue tracker I suggest to just move all (open) issues in bulk. While I see the appeal of implementing the stale bot first, this is also somewhat additional work and if we're doing a bulk transfer it shouldn't matter whether it is stale or not or in which repo it was closed. I still like the idea of using a stale bot but it feels like a different problem
- CI infrastructure would isolate the distributed test suite (it's an isolated path). If a PR is labelled with a special label, the entire test suite runs. This way, flakiness and runtime is isolated. Main always runs the entire test suite.
- Stale PRs are abandoned
dask/distributedrepo is archived
This sounds scary, but in principle I don't see why it should be a problem.
I think @fjetter @hendrikmakait @phofl are best placed to decide if this is worth the effort as you all feel the pain of making twinned PRs regularly.
I would be tempted to just migrate the last year of issues and close everything older, just to speed up the migration. But it's up to whomever is doing the migration.
A change was made last week to a utility method in https://github.com/dask/dask/pull/11757 which caused the distributed CI to break https://github.com/dask/distributed/issues/9016. This is useful anecdotal evidence in favour of merging the two repos.
However it's interesting that this PR would likely have resulted in the same problem if they were in the same repo because we would still be using path rules to run the dask and distributed CI independently. I wonder if there is some way that we can trigger the distributed CI on PRs where a dask utility is modified. We could incude dask/utils.py in the distributed CI but this change was in dask/core.py, so maybe that would need to be included too. I wonder if there is a more targeted way to run the distributed CI if code in dask is modified that is imported by distributed.
However it's interesting that this PR would likely have resulted in the same problem if they were in the same repo because we would still be using path rules to run the dask and distributed CI independently.
The whole matrix yes, but you could always run a single distributed ci job and vice versa to cover hard breakage
but you could always run a single distributed ci job and vice versa to cover hard breakage
Good point! That is a much simpler solution than I had in mind 😄