Mandatory commit maps, enabling squash-merge (meld with "subcommit" ideas)
I was thinking about the idea of caching a separate map between Main commits and Sub commits (#3), and I was thinking about how git push and git fetch can be used to share any ref under refs/ created by git update-ref (which is how sharing Git Notes works, for example) and how we might be able to use that to share the map between Main commits and Sub commits, and I might have stumbled on an idea to merge my subcommit ideas into this.
I've confirmed that refs can be trees or blobs, and that git push --force-with-lease=<refname>:<sha> works with them (other forms of --force-with-lease don't, though), and --force-with-lease has been in Git since 1.8.5, so I'm pretty sure this can work.
So my line of thinking went something like this. We wouldn't want git subhistory split/merge/push/pull/what-have-you to be crazy slow the first time after cloning a big repo, we would encourage/require people to push and fetch these maps. But if everyone's using these commit-to-commit maps, then there's no reason the underlying contents of the commits have to correspond as perfectly as subhistory is currently designed around.
Marking commits
In particular, squash-merging could totally work! For illustrative purposes, suppose that the 3rd commit on master to modify Sub comes before merging:
[HEAD]
[initial commit] [master]
o--------------------------o--------------------------o--------------------------o--------------------------o
Add a Main thing Add a Sub thing Add 2nd Sub thing Add 2nd Main thing Add 3rd Sub thing
_____________________ _____________________ _____________________ _____________________ _____________________
| | | | | | | | | |
| Files: | | Files: | | Files: | | Files: | | Files: |
| + a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing |
| | | + path/to/sub/ | | path/to/sub/ | | + 2nd-Main-thing | | 2nd-Main-thing |
| | | + a-Sub-thing | | a-Sub-thing | | path/to/sub/ | | path/to/sub/ |
| | | | | + 2nd-Sub-thing | | a-Sub-thing | | a-Sub-thing |
| | | | | | | 2nd-Sub-thing | | 2nd-Sub-thing |
| | | | | | | | | + 3rd-Sub-thing |
|_____________________| |_____________________| |_____________________| |_____________________| |_____________________|
Say we're squash-merging sub-upstream/master into master. As with normal git-subhistory merge, we split the history of Sub in HEAD out as SPLIT_HEAD, but then instead of assimilating the SPLIT_HEAD..sub-upstream/master commits, we first merge sub-upstream/master directly into SPLIT_HEAD:
[initial commit] [SPLIT_HEAD] [sub-upstream/master] [MERGE_HEAD]
o--------------------------o--------------------------o-----------------------------------------------------|--------------------------o
| |\-------------------------|--------------------------o--------------------------o-------------------------/|
Add a Sub thing Add 2nd Sub thing Add 3rd Sub thing Fix Sub somehow Fix Sub some more Merge branch 'sub-upstream/master' into path/to/sub/ subhistory of master
_____________________ _____________________ _____________________ _____________________ _____________________ _____________________
| | | | | | | | | | | |
| Files: | | Files: | | Files: | | Files: | | Files: | | Files: |
| + a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing |
| | | + 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing |
| | | | | + 3rd-Sub-thing | | + fix-Sub | | fix-Sub | | < 3rd-Sub-thing |
| | | | | | | | | + fix-Sub-more | | > fix-Sub |
| | | | | | | | | | | > fix-Sub-more |
|_____________________| |_____________________| |_____________________| |_____________________| |_____________________| |_____________________|
And we use that merged Sub tree in a new squash-merge commit on master:
[HEAD]
[initial commit] [master]
o--------------------------o--------------------------o--------------------------o--------------------------o--------------------------o
Add a Main thing Add a Sub thing Add 2nd Sub thing Add 2nd Main thing Add 3rd Sub thing Squash-merge subhistory branch 'sub-upstream/master' under path/to/sub/
_____________________ _____________________ _____________________ _____________________ _____________________ _____________________
| | | | | | | | | | | |
| Files: | | Files: | | Files: | | Files: | | Files: | | Files: |
| + a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing |
| | | + path/to/sub/ | | path/to/sub/ | | + 2nd-Main-thing | | 2nd-Main-thing | | 2nd-Main-thing |
| | | + a-Sub-thing | | a-Sub-thing | | path/to/sub/ | | path/to/sub/ | | path/to/sub/ |
| | | | | + 2nd-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing |
| | | | | | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing |
| | | | | | | | | + 3rd-Sub-thing | | 3rd-Sub-thing |
| | | | | | | | | | | + fix-Sub |
| | | | | | | | | | | + fix-Sub-more |
|_____________________| |_____________________| |_____________________| |_____________________| |_____________________| |_____________________|
Note that as far as the rest of Git is concerned, the squash-merge commit is a normal, non-merge commit (with only one parent) that happens to make changes only in path/to/sub/. But to subhistory, it's a commit assimilated from a Sub commit, with an entry in the commit map from the squash-merge commit to the Sub merge commit.
This is important because the squash-merge commit needs to be split out as that Sub merge commit. Suppose, one last time, another (4th) commit on master modifies Sub:
[HEAD]
[initial commit] [master]
o--------------------------o--------------------------o--------------------------o--------------------------o--------------------------o--------------------------o
Add a Main thing Add a Sub thing Add 2nd Sub thing Add 2nd Main thing Add 3rd Sub thing Squash-merge subhistory... Add 4th Sub thing
_____________________ _____________________ _____________________ _____________________ _____________________ _____________________ _____________________
| | | | | | | | | | | | | |
| Files: | | Files: | | Files: | | Files: | | Files: | | Files: | | Files: |
| + a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing | | a-Main-thing |
| | | + path/to/sub/ | | path/to/sub/ | | + 2nd-Main-thing | | 2nd-Main-thing | | 2nd-Main-thing | | 2nd-Main-thing |
| | | + a-Sub-thing | | a-Sub-thing | | path/to/sub/ | | path/to/sub/ | | path/to/sub/ | | path/to/sub/ |
| | | | | + 2nd-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing |
| | | | | | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing |
| | | | | | | | | + 3rd-Sub-thing | | 3rd-Sub-thing | | 3rd-Sub-thing |
| | | | | | | | | | | + fix-Sub | | + 4th-Sub-thing |
| | | | | | | | | | | + fix-Sub-more | | fix-Sub |
| | | | | | | | | | | | | fix-Sub-more |
|_____________________| |_____________________| |_____________________| |_____________________| |_____________________| |_____________________| |_____________________|
And then we split that out to push upstream:
[initial commit] [sub-upstream/master] [SPLIT_HEAD]
o--------------------------o--------------------------o-----------------------------------------------------|--------------------------o--------------------------o
| |\-------------------------|--------------------------o--------------------------o-------------------------/|
Add a Sub thing Add 2nd Sub thing Add 3rd Sub thing Fix Sub somehow Fix Sub some more Merge branch 'sub-upstr... Add 4th Sub thing
_____________________ _____________________ _____________________ _____________________ _____________________ _____________________ _____________________
| | | | | | | | | | | | | |
| Files: | | Files: | | Files: | | Files: | | Files: | | Files: | | Files: |
| + a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing | | a-Sub-thing |
| | | + 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing | | 2nd-Sub-thing |
| | | | | + 3rd-Sub-thing | | + fix-Sub | | fix-Sub | | < 3rd-Sub-thing | | 3rd-Sub-thing |
| | | | | | | | | + fix-Sub-more | | > fix-Sub | | + 4th-Sub-thing |
| | | | | | | | | | | > fix-Sub-more | | fix-Sub |
| | | | | | | | | | | | | fix-Sub-more |
|_____________________| |_____________________| |_____________________| |_____________________| |_____________________| |_____________________| |_____________________|
It's important that this split-out commit be a fast-forward from the Merge branch 'sub-upstream/master' into path/to/sub/ subhistory of master commit, so that it will be a fast-forward from sub-upstream/master; if upstream has further updates, Fix Sub some more will be the merge base. If instead this split-out commit weren't this non-squash merge commit, if instead the Fix Sub some{how, more} commits were squash-merged into Sub's history, then Fix Sub some more won't be the merge base and could well conflict.
One complication is that Git is a distributed systems problem: what if someone else pulls down the squash-merge commit, makes more changes to Sub on top, and then split it out? As noted above, it's critical that the squash-merge commit be split out as the underlying non-squash Sub merge commit so that the merge base with upstream will be the right one. How do we enforce that the commit map is up-to-date at split time? Ideas:
- Parse the fetch refspecs in git config, ensure there's a refspec to download the refs with the commit maps for...which remote, just
origin? What if they didn't use a remote, just passed a Git URL directly togit pullor something? Too many ways to pull in commits without using refspec in config for this to work. - Add a marker file in the tree with the hash of the subproject commit, like
path/to/sub/.gitsubhistory/assimilated-fromor something, to tell us to download and use the subproject commit. Problem: subsequent normal commits will have the same file with the same contents unless the user manually changes this file. - Impose some kind of formatting requirement on the commit message of assimilated commits, and parse that format. This seems like a bad idea, the
Merge branch blah blah...commit message format is really verbose and it's likely there are people who prefer to customize those to be more readable; GitHub overrides that, for example. It would be less bad if the requirement is merely "last line must be of the formAssimilated from da39a3ee5e6b4b0d3255bfef95601890afd80709." or something, but still, it feels like if the user manually edits the commit message and like, misspells "Assimilate" or something, that shouldn't breaksubhistory, that would be stupid. - Add a custom header to assimilated commits, like:
Supposedly "since we introduced the "encoding" header a while back clients have learned to ignore unknown headers", so this shouldn't break anything. However, I did some testing, and the unknown header does get thrown away bytree 8d640c644213d7e508971236aaeda72ea1b1a509 parent f45ffa8f782e7263702846facac99498788e6ce8 author Han Seoul-Oh <[email protected]> 1483412929 -0500 committer Han Seoul-Oh <[email protected]> 1483412929 -0500 +subhistory Sub da39a3ee5e6b4b0d3255bfef95601890afd80709 Subject line Commit message bodygit cherry-pickalways, and bygit rebase -iif any earlier commit is changed (probably because it runsgit commit-treewhich generates the commit from scratch).git commit --amenddoes preserve the header though, as doesgit rebase -iif nothing earlier changes (i.e., if the parent is the same hash). Also, this would obviously be more annoying to generate and parse than the commit message.
So, the squash-merge commit object itself must somehow be marked with the Sub merge commit to tell us to download and use it, the commit map alone is insufficient.
This should work for empty commits (#6), too. Open question: should we do this for every assimilated commit, then, not just squash-merge commits and assimilated empty commits? (If we do it for all assimilated commits, we wouldn't even need that direction of cache map, right? And takes care of transformed commit messages.)
Invariants
Another natural question is whether we should symmetrically be marking split-out commits too, but I think the answer to that is a definitive no. They're fundamentally asymmetrical: a given commit of Main has some fixed number of subprojects in it, whereas a given commit of Sub could be assimilated into any number of superprojects in the future. It would be weird for a superproject assimilating a Sub commit to have information on the hash of a commit in some other unrelated superproject (in the marking of the split-out commit).
And how would it be useful? Having the split-out commit have a transformed commit message (subcomponent prefix removed, for example)? So, what, next time we split out the Main commit C, we check to see if there's already a split-out Sub commit C' with a marking pointing back to C? Remember, distributed systems problem: what if someone else downloads C but hasn't downloaded C', when they split out C will they get a different hash from C'?
This is a fundamental thing that subhistory needs to satisfy, which leads to a fundamental invariant:
- When splitting out a Main commit at a particular path, there must be a unique Sub commit that we know how to get to from the commit object alone.
Note that the current guarantee is stronger than this, where there's a unique Sub commit that we're able to actually create from the commit object alone. This proposal weakens that guarantee: we may have to download a ref to the Sub commit, because a squash-merge commit just doesn't have enough information. But we know from the commit object alone (due to the marking) that we need to download that ref.
Problems:
- if two people modify commit map, how to sync? refspecs can only force-push/pull;
pre-pushhook that doesdisown(to daemonize) and then pushes refs to subcommits? - how to get tips of all subcommits ever split out or something? Or just branches, or what?
- if I push assimilated commit but fail to push commit-map and/or refs to subcommit, it doesn't get noticed until someone pulls and tries to split (or do something that triggers split like assimilate), and that person can't do anything about it, they have to go find the person who pushed and complain to them, unless we have the current strong requirement (can make unique Sub commit from Main commit object alone)
I was editing that over the course of many days, it's time to push it even though I have further thoughts.
if two people modify commit map, how to sync? refspecs can only force-push/pull
My first instinct is for a post-fetch hook to merge the remote commit map with the local one. Unfortunately, [there's no post-fetch hook], but inspired by that link I realized that merging commit maps doesn't actually need to happen immediately after fetch as long as it happens before the next push, or before the next split with a fetched commit. It's trivial for split to merge in the remote commit map as a first (zeroth?) step, and if there's no split before the next push, there's a pre-push hook (which unfortunately does get skipped if you do git push --no-verify, but it's the best we can do).
Just checked and the pre-push script can fetch and push no problem (was worried that before pre-push was invoked, git-push acquired a lock on pushing or something). So it could totally work to, for example,
- Based on objects being pushed to remote passed to
pre-push, determine which subhistory objects need to be pushed (e.g. if a squash-merge commit is being pushed, identify the Sub commits) - Merge local commit map with last fetched remote commit map
git push --force-with-leasecommit map and refs to Sub commits- if that failed, fetch remote commit map and go back to step 2 (error if we've hit this loop too many times)
Further notes:
- a
post-rewritehook exists that can help deal with rebasing assimilated or squash-merge commits - there's no hook for deleting branches, but a
pre-auto-gchook does exist that could go and delete refs to subcommits that are otherwise unreachable (how? Would it have to sweep all branches and tags? That would be unfortunate. Also concurrency safety would be another concern, if it's updating some kind of thing with the tips of the subhistory of every branch or something. Bonus of such a thing is its reflog would reference (and hence render reachable) the subcommits of any Main commits that are unreachable from the main branches except via reflog) - one interesting thought: if we preclude squash-merging, not only is it nice that the commit map and refs to subcommits are only an optimization and therefore we never have to worry about losing data, but we don't even need the refs, every subcommit can be created in O(1) from the Main commit and the map, as long as we can map the hashes of the parent commits to the parent subcommits
Wow, man, I did a first read for this evening and I have to say, that including info in custom header seems to me like a really clever thought! Also because my philosophy is "subhistory is fine, if the user is doing 'nonstandart' thing, then calling full split again is the price for it", so maybe there is no need to worry about everything (for example cherrypicking can be in practice used for supporting LTS versions...)
Also there did arise a question - does it work also with the option to not do squashing?
So this is for the first thoughts - I will read it again and have comments :)
does it work also with the option to not do squashing?
Yes! It might even be easier if I preclude squashing (although maybe not, since the problem of determining which Sub commits to keep refs to seems more or less the same as the problem of determining which commits to keep in the commit map)
Wow, just noticed #7, I hadn't thought about signatures at all but that suggests another transformation of commits when assimilating that it is important to undo when splitting out assimilated commits: stripping signatures. (That is, assimilating commits strips signatures; when splitting them out later, it's important to map back to the signed Sub commits.)
So that's a point in favor of adding the custom header or special commit message line to every assimilated commit, or at least certainly every signed one.
Note that if we keep around the commit map and signatures, we still don't need to keep around refs to the actual commits, that's enough information to recreate the Sub commit objects.
Cool, thanks for answer!
So, for 'custom header' option - is there a way then to see if: a) Everything is fine, we can use shortcuts (as maps I think you refer to them) or b) There is something messed up (cherry pick,rebasing, amending,...) and the price is then to split it again - to compensate for nonstandart behavior...?
My PR does that by simply comparing references and their histories, but for this clever strategy, how would the mechanism work?