jj FR: better detection of commits that were made empty

Is your feature request related to a problem? Please describe. i have a local change that was merged into main as 52f4fb1b27da2874155062c0c3ccfa67cd0fe291. i still have that change locally, and jj does not think it is a parent of main@origin (cc https://github.com/martinvonz/jj/issues/2978). if i try to rebase it with jj rebase -d main@origin unqytsx, i get a merge conflict:

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 4884a7383a...0000000000 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -87,12 +87,18 @@
 
 * `jj workspace root` was aliased to `jj root`, for ease of discoverability
 
+<<<<<<<
++++++++
 * `jj diff` no longer shows the contents of binary files.
 
 * `jj git` now has an `init` command that initializes a git backed repo.
 
 * New template function `surround(prefix, suffix, content)` is added.
 
+%%%%%%%
++* `jj diff` no longer shows the contents of binary files.
++
+>>>>>>>

this is very silly; the line it's trying to add is already present in the file, and it was added before the "jj git" line (before in history, the fact it's the line above shouldn't be relevant). so jj should notice that it's empty instead of thinking it's a conflict.

Describe the solution you'd like jj rebase should notice the change is now empty and i should end up with a commit with the same description as unqytsx but an empty diff. ideally that change itself should then be dropped (https://github.com/martinvonz/jj/issues/229), but that's separate from this issue.

Describe alternatives you've considered

Additional context i am pretty sure this "just works" with git, although i admittedly haven't tested.

Feb 07 '24 04:02 jyn514

I think you're right that git does this better. It calculates a "patch id" (basically a hash of a context-less diff output, IIRC) for the commits to rebase and for the upstream commits. Then it skips rebasing the commits that match a patch in upstream. I have been thinking of doing something similar in the planned jj git sync. Maybe we should add an option for that to jj rebase too.

Feb 07 '24 06:02 martinvonz

sounds great :)

I have been thinking of doing something similar in the planned jj git sync. Maybe we should add an option for that to jj rebase too.

slightly offtopic, but i'm strongly in favor of not blocking this on a new jj sync command. i've seen a couple different comments along the lines of "this looks cool, we'll put it in jj sync" and it reminds me of other projects where work has been delayed for months or years because it's part of a larger feature that never gets implemented.

that's not to say i don't think jj sync will get implemented, but i think it would benefit greatly from being scoped down, at least at first.

Feb 09 '24 22:02 jyn514

My ideal solution would be to create a conflict theory that can keep track of this, so that all conflicts are simplified better. This may be easier said than done, so jyn's remark about scoping down applies to this idea as well.

Feb 10 '24 06:02 ilyagr

I think you're right that git does this better. It calculates a "patch id" (basically a hash of a context-less diff output, IIRC) for the commits to rebase and for the upstream commits. Then it skips rebasing the commits that match a patch in upstream. I have been thinking of doing something similar in the planned jj git sync. Maybe we should add an option for that to jj rebase too.

This also sounds like this would address behavior I'm seeing:

My nixpkgs custom branch has empty commits as markers to separate chunks of commits (some are already upstream, some are personal patches meant to stay that way, some are specifically cross-build fixes, etc)
Today I rebased and some of the commits in my "upstream" section turned empty as they had landed in the branch that I rebase onto.
So I re-ran the rebase with --skip-empty, but of course this removed my nice spacer commits as well.

Also, I didn't see it mentioned, but if this smarter "skip commit already in rebase target (?)" logic is implemented, it would be nice if it were the default behavior.

Feb 12 '24 06:02 colemickens

I've been thinking about this for a while since I've run into this a few times, and I wonder if jj can use a simpler approach than Git. The algorithm I'm thinking of is this:

If commits S are being rebased onto destination D:

Find commits which are present in ::D but not in ::S. These are commits that might have duplicates in S.
Build a HashMap<Signature, Vec<CommitId>> based on the author signature for each commit. Since rebased commits should have the same author and author timestamp as the original commit, we should only have to compare commits with the same author signature.
Whenever a commit X from S is rewritten, check whether its author signature is in the HashMap. If it isn't, X should be kept.
If the author signature was present in the map, then filter the Vec<CommitId> based on the description of the commit. If no commits have the same description as X, then X should be kept. (This step may not be necessary, but I think it would make it more efficient if there are several commits with the same author timestamp for some reason. We might need to be careful about trailers being added though, so I'm not sure if it's safe to filter by description.)
For each remaining commit Y with the same author signature and description as X, compute the tree which would result from rebasing X onto Y-. If the resulting tree is the same as Y, then X can be discarded since it's a duplicate of Y.

This seems simpler to me since the patch IDs wouldn't need to be computed for every commit, which could be expensive, and we wouldn't need to cache the patch IDs anywhere. It does rely on the assumption that the author and author timestamp are the same after rebase, but that feels pretty safe to me.

Jul 29 '24 00:07 scott2000

It might be simpler to just rely on patch id (or change id if the backend or forge natively supports it.) Computing hash of diffs wouldn't be expensive compared to actual rebasing. We can of course add some heuristics based on signature, though.

Jul 29 '24 10:07 yuja

The main reason I was thinking that actually checking the rebased trees match could be better is that it might be correct in more scenarios. Specifically if there's a situation where A - B + A is simplified to A in a hunk of one of the files, that would cause the patch ID to be different after rebasing (if I'm understanding it correctly), so the patch ID approach would think the commits are different when they actually are the same. However if Git doesn't handle this case either, then maybe it's not common enough to matter.

Jul 29 '24 12:07 scott2000

please do not base this off author + commit message. i would like to be able to rebase other people's branches, and i would like to edit my commit messages, and breaking a core feature like this when that happens is very unexpected.

Jul 29 '24 14:07 jyn514