jj icon indicating copy to clipboard operation
jj copied to clipboard

jj creates divergent operations without any user input whatsoever

Open matts1 opened this issue 5 months ago • 21 comments

Description

jj creates divergent operations (and changes) when just typing into the editor - no commands required to run.

I have a theory on what the problem is (see actual behaviour section)

Steps to Reproduce the Problem

No commands involved, but I believe this is due to core.watchman.register-snapshot-trigger = true. I suspect the feature has race conditions that trigger when:

  • You're on a large repo (thus the snapshot takes longer - ~1 second for me)
  • You're using an editor that automatically saves files for you very frequently (I'm using vscode)

I get divergent changes in my jj log arbitrarily, after not having run any jj commands.

My op log contains:

@  cbda09db69fd [email protected] 1 minute ago, lasted 1 second
│  snapshot working copy
│  args: jj debug snapshot
○    4bf1c16327ae [email protected] 1 minute ago, lasted 965 milliseconds
├─╮  reconcile divergent operations
│ │  args: jj debug snapshot
○ │    f6ff969db703 [email protected] 1 minute ago, lasted 932 milliseconds
├───╮  reconcile divergent operations
│ │ │  args: jj debug snapshot
○ │ │  9594d9b3678a [email protected] 1 minute ago, lasted 1 second
│ │ │  snapshot working copy
│ │ │  args: jj debug snapshot
│ ○ │  6e6f484ee8d9 [email protected] 1 minute ago, lasted 1 second
│ ├─╯  snapshot working copy
│ │    args: jj debug snapshot
│ ○  2a563ca5d0de [email protected] 1 minute ago, lasted 1 second
├─╯  snapshot working copy
│    args: jj debug snapshot
○  680b07ec0f8d [email protected] 4 minutes ago, lasted 1 second
│  snapshot working copy
│  args: jj debug snapshot
○  88e1d6aa889c [email protected] 4 minutes ago, lasted 1 second
│  snapshot working copy
│  args: jj debug snapshot
...

Expected Behaviour

I expect to not have divergent changes

Actual Behaviour

jj creates two divergent changes. My theory is that:

  1. the user starts typing "ab". Vscode saves mid-sentence with only the content "a"
  2. Watchman sees a filesystem change and triggers a snapshot operation
  3. jj starts to create a new snapshot operation 2a56 based off the current op head (680b in the example above)
  4. jj reads the content "a" to the snappshot 2a56
  5. vscode writes "ab" to the file, triggering watchman's snapshot operation again
  6. jj starts to create a new snapshot operation 9594 based off the current op head (680b in the example above)
  7. jj writes the content "ab" to the snapshot 9594
  8. Both operations complete, creating divergent changes
  9. Divergent changes merge

If this theory is correct, then the solution I would propose is to, after completing a snapshot operation from watchman, check if the op head has changed, and if it has, you discard the current operation instead of attempting to reconcile it.

Specifications

  • Platform: Linux
  • Version: 0.29.0

matts1 avatar Jun 26 '25 00:06 matts1

I think @ilyagr also observed something similar when he used a older version of VisualJJ. We also should forward the env correctly to the watchman trigger as noted by @emilazy some time ago now.

PhilipMetzger avatar Jun 26 '25 16:06 PhilipMetzger

https://github.com/jj-vcs/jj/issues/3747 might be related.

ilyagr avatar Jun 26 '25 20:06 ilyagr

This was my experience with VisualJJ as well, and what caused me to remove it. I tried reproducing it without VisualJJ by interleaving actions manually, but was unable to do it and wrote it off as something about VisualJJ's inotify timing. But I didn't realize VSCode does frequent writes itself; that makes the above scenario seem very plausible.

hotsphink avatar Jun 26 '25 20:06 hotsphink

Background snapshotting processes like VisualJJ or watchman.register-snapshot-trigger inevitably have this problem. To mitigate this class of problems, I think we'll need a wider lock to serialize local operations. This lock will cover the whole snapshot->update_repo->publish_op scope. Since this lock doesn't affect the correctness (= divergence is valid outcome if you ran two jj commands concurrently), it's probably good to ignore/override stale lock file left by crashed jj process.

yuja avatar Jun 27 '25 01:06 yuja

I wonder if this is the same issue or a separate one. (I'm going back and forth on this)

I start with (Update: Expanded this example to show the gh-readonly-queue commits I already had)

$ jj log --at-op 3ebbe -r 'uxlp-..igtrunk' --no-pager
○  wyskno [email protected] 55 minutes ago gh-readonly-queue/main/pr-6836-5bfa79235d513cd885fb7b4d9f2a597e3e5b9a42@upstream a864eec
│  templater: make Operation object serializable
○  rsupsy [email protected] 55 minutes ago a3b0301
│  op_store: pack start/end_time into TimestampRange struct
○  qpwmpp [email protected] 55 minutes ago a8e5a5b
│  op_store: move TimestampRange type from templater
○  kmqumu [email protected] 55 minutes ago 39119ba
│  templater: inline TimestampRange::duration() -> String
○  tqylsw [email protected] 55 minutes ago gh-readonly-queue/main/pr-6835-3d3159ecd19fd5467d266235180729fd8364b1db@upstream 5bfa792
│  templater: make Commit object serializable
○  syokqt [email protected] 55 minutes ago 8adf0f6
│  templater: make CommitRef object serializable
○  zvtwss [email protected] 55 minutes ago f5424fa
│  merge: derive serde::Serialize for RefTarget
◆  uxlpvy [email protected] 1 hour ago main* 3d3159e
│  templater: make some more value types serializable
~

I then do jj git fetch and end up with the following:

Op log
$ jj op log -d
@    c2a9f7c31124 [email protected] 4 minutes ago, lasted 21 milliseconds
├─╮  reconcile divergent operations
│ │  args: jj log -r oldigtrunk..igtrunk
○ │  efe5291073f7 [email protected] 4 minutes ago, lasted 8 milliseconds
│ │  import git refs
│ │  args: jj op log --color never --quiet --no-graph --limit 1 --template id
│ │
│ │  Changed commits:
│ │  ○  - wyskno hidden a864eec templater: make Operation object serializable
│ │  ○  - rsupsy hidden a3b0301 op_store: pack start/end_time into TimestampRange struct
│ │  ○  - qpwmpp hidden a8e5a5b op_store: move TimestampRange type from templater
│ │  ○  - kmqumu hidden 39119ba templater: inline TimestampRange::duration() -> String
│ │  ○  - tqylsw hidden 5bfa792 templater: make Commit object serializable
│ │  ○  - syokqt hidden 8adf0f6 templater: make CommitRef object serializable
│ │  ○  - zvtwss hidden f5424fa merge: derive serde::Serialize for RefTarget
│ │
│ │  Changed remote bookmarks:
│ │  gh-readonly-queue/main/pr-6835-3d3159ecd19fd5467d266235180729fd8364b1db@upstream:
│ │  + untracked (absent)
│ │  - untracked tqylsw hidden 5bfa792 templater: make Commit object serializable
│ │  gh-readonly-queue/main/pr-6836-5bfa79235d513cd885fb7b4d9f2a597e3e5b9a42@upstream:
│ │  + untracked (absent)
│ │  - untracked wyskno hidden a864eec templater: make Operation object serializable
│ ○  0c48c0f4eb8c [email protected] 4 minutes ago, lasted 2 seconds
├─╯  fetch from git remote(s) upstream
│    args: jj git fetch --remote upstream --branch main --branch master --branch 'glob:gh-readonly-queue*' --branch 'glob:ig/*'
│
│    Changed commits:
│    ○  + ysoorm 981d0b7 main*| revset: inline expect_<construct>_with() parsing helpers
│    ○  + yvnusy 1f91541 revset: extract expect_<construct>() parsing helpers
│    ○  + umxltz cb5b171 revset: remove unused diagnostics argument from expect_literal()
│    ○  + vsnmsu fda95e5 revset: reimplement alias stack helper without using recursion
│    ○  + ouyuyx d620529 revset: normalize and simplify type error messages
│    ○  + pwrrqm f2b024c revset: fix alias stack of pattern/literal parse error
│
│    Changed local bookmarks:
│    main:
│    + ysoorm 981d0b7 main*| revset: inline expect_<construct>_with() parsing helpers
│    - uxlpvy 3d3159e templater: make some more value types serializable
│
│    Changed remote bookmarks:
│    gh-readonly-queue/main/pr-6835-3d3159ecd19fd5467d266235180729fd8364b1db@upstream:
│    + untracked (absent)
│    - untracked tqylsw 5bfa792 templater: make Commit object serializable
│    gh-readonly-queue/main/pr-6836-5bfa79235d513cd885fb7b4d9f2a597e3e5b9a42@upstream:
│    + untracked (absent)
│    - untracked wyskno a864eec templater: make Operation object serializable
│    main@upstream:
│    + tracked ysoorm 981d0b7 main*| revset: inline expect_<construct>_with() parsing helpers
│    - tracked uxlpvy 3d3159e templater: make some more value types serializable
○  3ebbe70c159a [email protected] 4 minutes ago, lasted 93 milliseconds
│  rebase commit 05b3400e015fbbeb3484e684fbfcfa05ba67a071
...

Log:

$ jj log -r 'uxlp-..main|main@upstream' --no-pager
○  ysoorm [email protected] 8 minutes ago main* 8a5e6c2
│  revset: inline expect_<construct>_with() parsing helpers
○  yvnusy [email protected] 8 minutes ago e72ebcc
│  revset: extract expect_<construct>() parsing helpers
○  umxltz [email protected] 8 minutes ago 08bc5f4
│  revset: remove unused diagnostics argument from expect_literal()
○  vsnmsu [email protected] 8 minutes ago 4de5791
│  revset: reimplement alias stack helper without using recursion
○  ouyuyx [email protected] 8 minutes ago 158f14a
│  revset: normalize and simplify type error messages
○  pwrrqm [email protected] 8 minutes ago be1d65a
│  revset: fix alias stack of pattern/literal parse error
│ ◆  ysoorm hidden [email protected] 28 minutes ago main@git main@upstream 981d0b7
│ │  revset: inline expect_<construct>_with() parsing helpers
│ ~  (elided revisions)
├─╯
◆  uxlpvy [email protected] 1 hour ago 3d3159e
│  templater: make some more value types serializable
~

Separate question

Update: This is solved in https://github.com/jj-vcs/jj/issues/6830#issuecomment-3011152939, the precedence of operations was not what I thought. 🤦

I find two things about the jj log behavior also weird:

  • In the second one, why were the hidden revisions in between uxlp and main@git not shown?

  • In this one below (which was previously above, but no longer), I made the mistake of using uxlp where I should have had uxlp-, and yet I still got output.

$ jj log --at-op 3ebbe -r 'uxlp..main|main@upstream' --no-pager
◆  uxlpvy [email protected] 1 hour ago main* 3d3159e
│  templater: make some more value types serializable
~

Update 2: In case it matters, my version of jj is roughly at commit 6323764 from 3 days ago.

ilyagr avatar Jun 27 '25 02:06 ilyagr

Background snapshotting processes like VisualJJ or watchman.register-snapshot-trigger inevitably have this problem. To mitigate this class of problems, I think we'll need a wider lock to serialize local operations. This lock will cover the whole snapshot->update_repo->publish_op scope. Since this lock doesn't affect the correctness (= divergence is valid outcome if you ran two jj commands concurrently), it's probably good to ignore/override stale lock file left by crashed jj process.

I proposed a simple solution above, It would probably look something like this:

enum ResolveStrategy {
  DISCARD,
  MERGE,
}
struct Operation {
  ...
  strategy: ResolveStrategy
}

ResolveConcurrentChanges(op1, op2) {
  if (op1.strategy == Discard) {
    abandon(op1);
  } else if (op2.strategy == Discard) {
    abandon(op2);
  } else {
    merge(op1, op2);
  }
}

Is there a reason we couldn't take an approach like this instead of widening the lock? I'd be concerned that widening the lock could slow down a foreground process of jj if a background one is running already.

matts1 avatar Jun 27 '25 02:06 matts1

$ jj op log -d
@    c2a9f7c31124 [email protected] 4 minutes ago, lasted 21 milliseconds
├─╮  reconcile divergent operations
│ │  args: jj log -r oldigtrunk..igtrunk
○ │  efe5291073f7 [email protected] 4 minutes ago, lasted 8 milliseconds
│ │  import git refs

If you ran jj while background jj/git process is updating Git refs, the concurrent jj process might import intermediate state. I don't know if this was the case, though.

$ jj log -r 'uxlp-..main|main@upstream' --no-pager

Maybe you mean uxlp-..(main|main@upstream)?


I proposed a simple solution above, It would probably look something like this:

enum ResolveStrategy {
  DISCARD,
  MERGE,
}
struct Operation {
  ...
  strategy: ResolveStrategy
}

ResolveConcurrentChanges(op1, op2) {
  if (op1.strategy == Discard) {
    abandon(op1);
  } else if (op2.strategy == Discard) {
    abandon(op2);
  } else {
    merge(op1, op2);
  }
}

If we add metadata to operation, not to operation heads, maybe we'll need to scan all operations down to the common ancestor operation, and discard intermediate operations as needed?

Is there a reason we couldn't take an approach like this instead of widening the lock? I'd be concerned that widening the lock could slow down a foreground process of jj if a background one is running already.

There are other related problems (such as racy git imports/exports) that can be addressed by wider lock.

yuja avatar Jun 27 '25 02:06 yuja

$ jj log -r 'uxlp-..main|main@upstream' --no-pager

Maybe you mean uxlp-..(main|main@upstream)?

Indeed, in both cases, thanks.

ilyagr avatar Jun 27 '25 02:06 ilyagr

Do we have a way forward on this or a workaround? This occasionally makes jj nigh unusable on a big repo - especially if I'm trying to clean up a diff or history.

  1. Make a change
  2. Workspace stale - cannot do anything.
  3. Fix / delete lock/ abandon divergent etc
  4. jj status
  5. Goto 1

stephenprater avatar Oct 16 '25 16:10 stephenprater

I've considered adjusting the code for reconciling concurrent operations to look for concurrent snapshot operations and keeping only the newest one.

martinvonz avatar Oct 16 '25 16:10 martinvonz

I've considered adjusting the code for reconciling concurrent operations to look for concurrent snapshot operations and keeping only the newest one.

Disclaimer, I don't know the rust code base at all so this idea might be dumb, but seeing all the discussions focusing on the oplog and changes to it, I wonder:

Would it be possible for watchman to simply queue up changes and wait until a previously triggered snapshot operation has completed before requesting the next one? AFAIU, this would avoid concurrent oplog changes (unless the user runs additional jj commands) triggered by running watchman without needing to change jj core logic.

tim-janik avatar Oct 16 '25 19:10 tim-janik

Would it be possible for watchman to simply queue up changes and wait until a previously triggered snapshot operation has completed before requesting the next one? AFAIU, this would avoid concurrent oplog changes (unless the user runs additional jj commands) triggered by running watchman without needing to change jj core logic.

This isn't a problem specific to watchman. I think @mlcui-corp ran into this issue as well without using watchman in a different context, I don't remember exactly what he did to trigger it though.

matts1 avatar Oct 17 '25 00:10 matts1

Same problem here. I've just started using Jujutsu on my local Firefox repo, and I frequently need to do bisects. Sadly, my auto-bisect session often gets interrupted by Internal error: Failed to check out commit [...] Caused by: Concurrent checkout. Would be delighted to see a workaround or solution.

CSharperMantle avatar Oct 23 '25 05:10 CSharperMantle

IME "Concurrent checkout" is quite easy to trigger even without watchman. I keep one terminal window permanently running watch jj log (like @martinvonz does in one of his demos). Running something like while cargo c && jj next; end in another window produces empty commits everywhere, stale checkouts, etc. - and you can't even jj undo your way out of it because the oplog is no longer linear.

Anyway, I fixed my issues by aliasing jj to flock .jj/lock jj.

asayers avatar Oct 27 '25 03:10 asayers

I think we should consider holding a lock across the whole command if we're going to snapshot the working copy (i.e. unless --ignore-working-copy or --at-op was passed). We don't currently do that because - unlike most VCSs - we don't need to do it for correctness. We can still it if it's going to result in better UX, and it seems to me like it would.

martinvonz avatar Oct 29 '25 05:10 martinvonz

Using a lockfile as a "best-effort" measure for reducing divergent ops sounds like a good plan to me. My experience of doing this is mostly positive. I haven't had to clean up my tree or run workspace update-stale since making this alias. The only place I've wished for a more fine-grained lock is when running bisect. It keeps the lock for a long time and you can't run log to see what's going on. Not the end of the world though.

asayers avatar Oct 29 '25 07:10 asayers

We might be able to get away with a smaller lock than whole operation. it looks like most of the divergence is coming from a race on Git HEAD - attempted to fix #7895

sheremetyev avatar Oct 29 '25 10:10 sheremetyev

My concern with a lock-based approach to snapshotting is that of performance. Since watchman will trigger an snapshot every time a file is written, which for an IDE such as vscode, can be as frequent as every second or so, this is potentially a large concern to me as someone working on a very large repo.

I've considered adjusting the code for reconciling concurrent operations to look for concurrent snapshot operations and keeping only the newest one.

Was there a reason why this turned out to not be viable?

matts1 avatar Oct 29 '25 12:10 matts1

Workspace snapshotting is already protected by a lock, #7895 just includes interaction with Git HEAD under the same lock.

Regarding performance concerns - probably increasing settle parameter in Watchman config should help?

sheremetyev avatar Oct 29 '25 13:10 sheremetyev

IME "Concurrent checkout" is quite easy to trigger even without watchman. I keep one terminal window permanently running watch jj log (like @martinvonz does in one of his demos). Running something like while cargo c && jj next; end in another window produces empty commits everywhere, stale checkouts, etc. - and you can't even jj undo your way out of it because the oplog is no longer linear.

Anyway, I fixed my issues by aliasing jj to flock .jj/lock jj.

The only place I've wished for a more fine-grained lock is when running bisect. It keeps the lock for a long time and you can't run log to see what's going on.

You should be able to just run watch jj log --ignore-working-copy without locking. This will not create new snapshots or op log entries for you (e.g. if you modified a file via command line that will not be inspected), but it will keep showing you updates about the effects of other jj operations you have run so far, and works also during bisect.

tim-janik avatar Nov 03 '25 11:11 tim-janik

That's a good tip, thanks!

asayers avatar Nov 03 '25 12:11 asayers

@matts1, the fix for Git HEAD landed - would be great if you could install jj from latest main and see if the original problem with watchman still reproduces. Thanks!

sheremetyev avatar Nov 25 '25 14:11 sheremetyev

I'm not familiar enough with your PR to know whether it was the cause of it, but my jj was hanging, and I managed to fix it by deleting both .jj/working_copy/working_copy.lock and .jj/repo/git_import_export.lock. Is there any chance that hitting control-C while it's snapshotting could leave the lock there so future snapshots can't succeed?

matts1 avatar Nov 27 '25 02:11 matts1

Is there any chance that hitting control-C while it's snapshotting could leave the lock there

On Windows, yes. On Unix, a lock file with no living processes is supposed to be ignored.

yuja avatar Nov 27 '25 07:11 yuja

Thanks for testing @matts1! There was discussion of potential deadlock from the use of two locks but I'm not sure that is it yet.

Are you able to reproduce jj hanging or did it happen only once? Can you please share more details and the steps leading that hang, if you remember?

sheremetyev avatar Nov 27 '25 09:11 sheremetyev

It makes sense that it hangs after a stray lock file is left on disk due to ^C, right? That's my understanding of what happened.

martinvonz avatar Nov 27 '25 14:11 martinvonz

It makes sense that it hangs after a stray lock file is left on disk due to ^C, right? That's my understanding of what happened.

I haven't been able to reproduce that (on macOS). We are using flock on Unix, which would release the lock when all file descriptors are close on process termination.

@matts1 is this repo on a network filesystem by any chance? (Semantics of flock may be different there.)

sheremetyev avatar Nov 27 '25 16:11 sheremetyev

It's only happened once, so I'm not sure.

  • I'm running linux
  • It's not a network filesystem
  • It was on the jj repo itself, so not an especially large one.
  • I'm not sure precisely what I was doing, but it did occur while I was working on some code related to snapshotting of the working copy. I don't believe I was touching any of the locking code though.
  • I was mixing and matching using a stable jj version and this new jj version
    • Was using the old version of jj most of the time
    • Was testing out a jj I'd built
    • Visualjj was running in the background on an old version of jj

I'm not confident it's your PR causing the problem. I can try and switch to the new version of jj on a more permanent basis to try it out and see if I can reproduce it.

matts1 avatar Nov 27 '25 21:11 matts1

I was able to reproduce this problem. I'm not entirely sure what the root cause is, but:

  • This time I was using a combination of jj built from head and google's internal stable jj, which was last synced from upstream on November 15.
  • I was able to verify that removing git_import_export.lock did not solve the problem, but removing working_copy.lock after that did.

matts1 avatar Dec 03 '25 03:12 matts1

Let's close this because the divergence bug should be fixed. Feel free to open a new bug for the lock file issue.

martinvonz avatar Dec 04 '25 05:12 martinvonz