Updated copy-tracing proposal to handle conflicts
Checklist
If applicable:
- [ ] I have updated
CHANGELOG.md - [ ] I have updated the documentation (README.md, docs/, demos/)
- [ ] I have updated the config schema (cli/src/config-schema.json)
- [ ] I have added tests to cover my changes
- To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1209241374115289
I'm getting tired of working on this doc, but I think it's good enough that it's reasonable clear how it will work. So please take a look if you're interested. If you spot problems, it's good to hear about them before I spend time polishing it.
@newren: As the Git rename detection expert, maybe you're interested in the design presented here. Heads up that it's mostly about rename tracking, not detection.
@drieber and I talked about copy tracking in person today. He convinced me that we should not automatically propagate changes to copies and instead leave them conflicted. Then it becomes a simple yes/no question for the user whether to propagate the changes (jj resolve can prompt the user). One benefit of this is that rebasing a modification onto a copy and then back does not result in conflicts. I'll probably update the doc to reflect this next week.
He convinced me that we should not automatically propagate changes to copies and instead leave them conflicted.
TLDR: I also believe this. I think this relates closely to my general worries about tracking copies, and that having a satisfying solution here might not be practical. Do you have something in mind?
Here is how the confusion is arranged in my mind.
I think the kind of conflict we'd get from changes to copies would be a very different kind of conflict to our usual content conflicts with two sides and a base. If you treated it like a normal Merge<Tree> conflict with unchanged tree as base, change to first copy as one side, and change to the other copy as the second side, it would get simplified immediately and both copies would be modified.
So, either we'd have to force the user to resolve the conflict immediately (like Git does for content conflicts), or we'd need to learn to store a whole new kind of conflicts in the repo.
One option that comes to mind and seems natural is, in a Tree, for every path, in addition to its blob, we could store an "annotation" that is a set of patches that may or may not need to be applied to this path. Say you have a commit A with file f and then make some new commits:
change f
A---------->B
\ copy f to g and h
\-------------------->X
Then, if you merge X and B (or rebase X onto B), you'd get an unconflicted Merge<Tree> (that is, only one tree), where both g and h (and maybe f as well) in this tree would have their content from A, but both of them would also have the annotation that the patch "change f" might need to be applied to it. When presenting this to the user, we would present it as a conflict.
I'm not fully sure whether there's a sensible algebra for these kind of annotations, but hopefully? I am apprehensive that it might require building a whole new patch-based VCS on top of jj, since these "annotations" might need to compose. E.g. if you have A --> B --> C where both B and C change f, rebasing X onto C should result in the same annotations as if you rebased it onto B first, and then rebased the result onto C.
Aside: Building a combined snapshot + patch-based VCS would be cool, would probably solve this problem (update: this is a hunch; my guess is that patch theory conflicts and copy-tracking conflicts could be unified in one theory), and I've been excited about the possibility for a while. However, it's far from obvious how to marry the two models together into one coherent picture, or whether it's even doable. It's even less clear whether it's doable in a performant way. So, I'm hoping this is not necessary for copy-tracking, though I'm not sure.
I also wonder what "unsatisfying" but less involved solutions to this would look like.
One option (!) might be to restrict what people can do with revisions that have this kind of conflicts, e.g. forbid rebasing them in ways that would affect the problematic files. This might be better than giving up on rebasing back and forth being a no-op (which would probably also mess up associativity of rebasing further). I wonder if this is practical, for example whether we could still allow commits to be rebased onto the problematic commits (so that you could always rebase all descendants of a commit).
Another seeming solution would be to just say that, after all, we should automatically propagate changes to copies. I don't believe this solves the problem either, I am looking at my old notes to remember why. Update: See https://github.com/jj-vcs/jj/pull/4988#discussion_r2065374391 below. That doesn't say that we couldn't automatically propagate changes to copies, but it does say that we'd still need additional kinds of conflicts if we wanted "rebase, then rebase back" to result in the same commit.
He convinced me that we should not automatically propagate changes to copies and instead leave them conflicted. Then it becomes a simple yes/no question for the user whether to propagate the changes (
jj resolvecan prompt the user).
I'm struggling to see how the user would know whether to propagate the changes, especially if there is a lot of history. Unless there are restrictions on copies, I think you have to base the design on the most complicated scenarios, like circular references (copy one to two, make a change, copy two to one, rebase any of those), or a series of copies (copy one to two, copy two to three, copy three to four...).
The more I think about it, the more I think a command to copy files doesn't belong in jj. I can't think of one benefit of having it, and it certainly makes a lot of complications. I say leave the editing and file commands to external programs, and let the VCS record the snapshots. (I think of fix this way also.)
@ilyagr: I've made the updates to say that we don't automatically propagate changes to copies. Do you think this is good enough to merge? I would very much like it to be merged because I'm tired to updating it :) Once merged, it's going to be easier for e.g. @steadmon and @jonathantanmy to make further edits.
I'm struggling to see how the user would know whether to propagate the changes, especially if there is a lot of history
I'm also a bit worried about this, but I think we'll have some freedom to experiment with what cases, exactly, we call a conflict. We could also provide quick and automatic ways to resolve such conflicts where they make sense.
(Non-actionable thoughts)
Thinking about this leads me to a philosophical issue that bothers me a bit, where a user might want to be notified about potential copies even if there is no rebase going on. I wonder whether a VCS could or should help with that, it's unclear.
Let's say I'm working of a bugfix to file foo in commit A while somebody from another department of my company copied foo to bar in their codebase (which I'm unaware of) in commit X.
Case 1, good: Now, if I started and finished my work before X happened, the other department will get the bugfix.
Case 2, good: If I started my work before X happened and merged to main after X happened, I would pressumably get a conflict when I rebase A onto main. This seems good, I can reason about whether the other department needs my bugfix.
Case 3, not so good: However, if I started my work after X happened, there is no rebase and there is no conflict, the other department just doesn't get my bugfix at all.
Is there some way a VCS could help and notify me about potential other files I'd want to clone my changes to?
OTOH, similarly to Joy's original concern, there are probably cases where notifications for Case 2 or Case 3 would become noisy and unhelpful, so we'd have to experiment with that as well.
Thanks for reviewing! I'll try to update the doc within the next few days.
Is there some way a VCS could help and notify me about potential other files I'd want to clone my changes to?
It shouldn't be hard to provide a feature which tells you which files have been copied from a given file. We could also provide a jj recopy command to copy over the change that happened to the source file since it was copied into the destination file. However, the model proposed in this document doesn't provide a way to record that the destination is now in sync with the source file again, so we cannot figure out what a subsequent jj recopy should do. For that to work, I suppose we would have to record not only the copy source in the copy graph node but also the FileId of the copy source. I have not thought about what the consequences of that would be.
Case 2, good: If I started my work before
Xhappened and merged tomainafterXhappened, I would presumably get a conflict when I rebaseAontomain. This seems good, I can reason about whether the other department needs my bugfix.
I don't agree that you could reason what the other department needs. I was reading this doc to mean that the copy was traced, not the original. Perhaps this is semantics, but just because someone copied my file shouldn't obligate me to fix their subsequent problems. Rebasing A shouldn't be a conflict. I can see how the person copying might want to know, but I think there are more cases when they wouldn't want to know ( a folder of templates everyone starts from).
I'm still waiting to hear a case where tracing a copy is a thing the VCS should do.
@joyously Hey I haven't thought about this deeply to the level of writing a design doc, but I do think VCS should track copies and moves.
Here's my 2 minute case for it: Often copies and moves are meant for uncomplicating things, reducing complexity, reducing the burden on the reader of the code in trying to understand it. Features like blame and commit log are crucial in being able to understand why the code is the way it is, in what context was a line of code written, etc. If renaming a file(moving) breaks the commit log,(also breaking blame as a result), then it's a huge setback in providing the context behind the state of the code to the reader. This is counter productive to the very thing that the coder was trying to do by renaming the file, simplify and organize things into being more comprehensible.
Same thing applies to copy mutation(sometimes even more acutely). One common case of refactoring is when a file of code keeps blowing up and the file is just too big, now I want to split it into multiple smaller files, again with the purpose of simplifying things and making it easier for the reader to understand the code. But it is absolutely crucial that the code doesn't lose it's history of when, how and why it was checked in, in the first place. The best way I've found to preserve this information, is to track the file across copies. With hg, I'll often duplicate a file and remove a portion of the code from it, and from the other copy, I'll remove all the code except the portion that I removed from the other file. I've effectively split the file into two without adding any lines of code(except for some boilerplate to preserve the scope of the code). Because hg tracks the copies, I can still see the original commits when each line of code was checked in, in BOTH splits of the file...
I hope I've been able to illustrate the use cases and provide what might be a first case for copy-tracing in a VCS to you.
Cheers. PS: Sorry if I misunderstood this conversation that I just jumped in the middle of.
Edit: I for one, have been desperately waiting for this feature to be included in jj
Another Edit: Ultimately what matters to me is that when I copy or rename files, if I try to see the commit log of that file or see the blame annotations, then all the commits show up appropriately in both those views, even the commits from before I copied or renamed or moved the file.
@spundun Thanks for the use case for tracking a copy. (I already agree that renames should be tracked.) I don't think your use case is a good use of copy; it seems that tracking a split would be better. The discussion above was about the VCS showing conflicts in copies, and making the same changes in copies as in the original, when a rebase happens. This is the part where I think the user can't possibly know if it should as it can get really complicated having to know all the history of the code.
If the tracking of copy was only for being able to show blame correctly, that seems okay, although it seems that blame could show the fact of the copy and that would be sufficient.