jj Improve cleanup of `working

As @colemickens mentioned in #83, the working_copy.lock file too easily gets left behind without getting cleaned up.

IIRC, the way Mercurial handles it is by writing the process ID to the file. Then other processes waiting for the lock can check if the process that took the lock is still running.

We may also want to reduce the risk of even leaving the file there.

Mar 02 '22 07:03 martinvonz

I keep being bitten by this since I have watch jj log and watch diff -s running in the background.

In order to reproduce such problems, a fun stress test on the jj repository is:

cd /tmp && jj git clone https://github.com/martinvonz/jj.git && cd jj # Don't do this on a repository you actually use
while true; do  jj rebase -s e0ea -d 7dbb- && sleep 0.1 &&  jj rebase -s e0ea -d 7dbb; sleep 0.1; done

This can be combined with watch -n 0.5 diff -s and/or watch -n 0.5 jj log running in other terminals.

I get two kinds of errors:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Other("Git commit '45a1675ae9796e8076b1b45f3508aea88c5ef70e' already exists with different associated non-Git meta-data")', lib/src/store.rs:117:60

and

thread 'main' panicked at 'failed to create lock file /home/ilyagr/tmp/jj/./.jj/working_copy/working_copy.lock: File exists (os error 17)', lib/src/lock.rs:56:25

(the latter may only happen when also running watch, not sure). Moreover, watch jj diff -s combined with the rebasing seems to lead to a state when I have to manually delete working_copy.lock. Otherwise, even without anything else running, jj log doesn't work.

Sep 12 '22 03:09 ilyagr

Anecdotally, I've run into this fairly often by invoking jj commands interactively via fzf https://gist.github.com/chooglen/c76b2a6950aade154408fad2c42f97d7

Sep 27 '22 04:09 chooglen

I didn't have to work very hard to hit this:

slynux〉/home/cole/code/_jj/nixpkgs〉jj git clone https://github.com/nixos/nixpkgs master
Fetching into new repo in "/home/cole/code/_jj/nixpkgs/master"
Working copy now at: 3e486fae5655 (no description set)

slynux〉/home/cole/code/_jj/nixpkgs/master〉jj log
^C

slynux〉/home/cole/code/_jj/nixpkgs/master〉jj log | head -n3
thread 'main' panicked at 'failed to create lock file /home/cole/code/_jj/nixpkgs/master/./.jj/working_copy/working_copy.l
ock: File exists (os error 17)', lib/src/lock.rs:56:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

edit: it's been long enough that I forgot I triggered this initial issue -_-

Oct 27 '22 15:10 colemickens

^C

Yep, jj doesn't handle SIGINT/SIGTERM yet and the process is just terminated.

Oct 28 '22 04:10 yuja

I looked at implementing that handling for a bit, but I think it would be a much better use of resources to do the PID trick or similar instead. Cleaning up lock files on signal would require a fair amount of piping shared state around to every lockfile creator, would complicate cleaning up a jj that's stuck on e.g. a broken network filesystem, and would still not solve the equivalent problem in the face of unhandleable interruptions like power loss.

Oct 28 '22 04:10 Ralith

I agree something like PID trick is really nice, but I think we'll need to handle signals anyway to shutdown pager cleanly.

I have no idea about the best practice in Rust, periodically checking atomic boolean, channel, signalfd, async?

Oct 29 '22 04:10 yuja

we'll need to handle signals anyway to shutdown pager cleanly.

Good point. @chooglen, since you're working on pager support, don't miss this.

Oct 29 '22 05:10 martinvonz

we'll need to handle signals anyway to shutdown pager cleanly.

Good point. @chooglen, since you're working on pager support, don't miss this.

It's not a strict requirement, ftr. Terminal would be messed up if the parent process killed by ctrl+c, but that shouldn't be a blocker for the initial pager support.

Oct 29 '22 05:10 yuja

Terminal would be messed up if the parent process killed by ctrl+c

I'm not sure this is true. Killing a parent process doesn't automatically kill the child process; a child may opt into receiving SIGHUP, which less probably does and then handles gracefully. ^C on a terminal is broadcast to the entire process group, per https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap11.html#tag_11_01_09, although testing indicates less seems to just ignore SIGINT anyway. ^C while the pager is open will therefore probably terminate jj and leave the pager running gracefully without any special effort.

I think we'll need to handle signals anyway to shutdown pager cleanly.

Just handling a signal is fine, and may be useful for cleaning up terminal state; piping state around everywhere to clean up lockfiles is what I want to avoid.

Oct 29 '22 05:10 Ralith

^C on a terminal is broadcast to the entire process group, per https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap11.html#tag_11_01_09, although testing indicates less seems to just ignore SIGINT anyway. ^C while the pager is open will therefore probably terminate jj and leave the pager running gracefully without any special effort.

Yeah, so after ^C, less process would be orphaned while maybe it's interacting with the tty. I think that's the problem.

Just handling a signal is fine, and may be useful for cleaning up terminal state; piping state around everywhere to clean up lockfiles is what I want to avoid.

Doing things like that in signal handler would be a mess, I agree.

Oct 29 '22 06:10 yuja

less process would be orphaned while maybe it's interacting with the tty

less itself can and does handle this case, is my point. I don't think you can get it into a bad state with anything short of SIGKILL.

Oct 29 '22 07:10 Ralith

less itself can and does handle this case, is my point. I don't think you can get it into a bad state with anything short of SIGKILL.

If less cleaned up the terminal after the shell printed prompt, the prompt would disappear. If less were still alive, the prompt would be visible, but any input would be consumed by less.

The end result isn't terribly bad, but it's not clean either.

Oct 29 '22 07:10 yuja

I tested this empirically with

fn main() {
    let mut pager = Command::new("less").stdin(Stdio::piped()).spawn().unwrap();
    let pipe = pager.stdin.take().unwrap();
    for i in 0.. {
        (&pipe).write_all(format!("foo {}\n", i).as_bytes()).unwrap();
        thread::sleep(Duration::from_millis(100));
    }
}

less does shut down automatically, but it does indeed also leave the terminal in a kind of funny state.

Oct 29 '22 17:10 Ralith

Thanks for checking. As long as we don't leave a process behind on every ^C, it seems fine to leave the signal handling for later (as long as the terminal state isn't too broken - I'm guessing reset will be enough).

Oct 29 '22 17:10 martinvonz

Yeah, nothing sticky, even hammering ^L a couple times cleans it up.

Oct 29 '22 17:10 Ralith

For dealing with SIGTERM and ^C, there is https://crates.io/crates/ctrlc and https://crates.io/crates/signal-hook for more fine control. I was trying to figure out how to get a TUI program to clean up the terminal on kill for hwatch (for now, you might want to use https://github.com/blacknon/hwatch/pull/77, in particular the --no-title option), and these seem to be the best ways, but I haven't really tried either.

Dec 13 '22 02:12 ilyagr

@ilyagr I got the same error you reported before:

thread 'main' panicked at 'failed to create lock file /Users/isinyaaa/repos/redhat/covscan/.jj/working_copy/working_copy.lock: File exists (os error 17)', lib/src/lock.rs:56:25

But I haven't been using watch at all. It seems that I caused it by simply having a very big file that wasn't ignored, so jj broke while trying to add it to the working copy. After I deleted the file and jj's working_copy.lock, the error disappeared.

IDK if this is related to this issue at all, though... Should I open a new one?

Feb 22 '23 15:02 antisaling

That's the same issue. The file is left behind if the process doesn't exit properly while snapshotting or updating the working copy.

Feb 22 '23 15:02 martinvonz

I can understand that a lock file can be left behind every once in a while, but does it have to take so long to detect?

$ time jj status
thread 'main' panicked at 'failed to create lock file /home/hanwen/vc/grit/.jj/working_copy/working_copy.lock: File exists (os error 17)', lib/src/lock.rs:56:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

real	0m5.889s
user	0m0.016s
sys	0m0.014s

IIRC, the way Mercurial handles it is by writing the process ID to the file. Then other processes waiting for the lock can check if the process that took the lock is still running.

why not use operating system file locks? If the process dies, the lock is released automatically.

Feb 27 '23 22:02 hanwen

I can understand that a lock file can be left behind every once in a while, but does it have to take so long to detect?

If the file is there because another process is currently snapshotting or updating the working copy, then we want to give that process at least a few seconds to finish. If the file contained the process id, then we could notice right away if the process was no longer there.

why not use operating system file locks? If the process dies, the lock is released automatically.

That might be a good solution. Do such locks exist on all major platforms, including Windows? Do they work on all major file systems, or would we need a fallback (it may still be a good idea)?

Feb 28 '23 01:02 martinvonz

If the file is there because another process is currently snapshotting or updating the working copy, then we want to give that process at least a few seconds to finish.

If you wait X seconds, it depends on the size of the repo if X is an appropriate value. Having concurrent processes should be rare, and the user would normally be aware of this? I think the timeout could be much shorter.

That might be a good solution. Do such locks exist on all major platforms, including Windows? Do they work on all major file systems, or would we need a fallback (it may still be a good idea)?

In general they may not work on network file systems (eg. NFS), but the trick of storing PID in a file also doesn't work if a remote computer writes a lockfile.

for lock files on unix, see https://lwn.net/Articles/586904/ - I think BSD style locking is what you'd want (should be supported on OSX, which is BSD too)

I'm not very familiar with Windows, but https://en.wikipedia.org/wiki/File_locking suggests that locking files is the default on Windows.

For non-networked filesystems, the OS kernel should handle the locks (as all contenders for the lock are running on the same machine, there is no need for the filesystem to do anything.)

Feb 28 '23 11:02 hanwen

Thanks for the links.

If you wait X seconds, it depends on the size of the repo if X is an appropriate value.

Right, that's why the value is unnecessarily high in small repos :(

Having concurrent processes should be rare, and the user would normally be aware of this? I think the timeout could be much shorter.

Since there's not yet any integration by any tools (IDEs/editors, mostly), I agree that the risk is very small.

In general they may not work on network file systems (eg. NFS), but the trick of storing PID in a file also doesn't work if a remote computer writes a lockfile.

Mercurial stores the hostname to help with that, but I haven't checked how it uses it (maybe it doesn't break the lock if it was taken on another host?).

Feb 28 '23 16:02 martinvonz

Mercurial stores the hostname to help with that, but I haven't checked how it uses it (maybe it doesn't break the lock if it was taken on another host?).

Correct. On Linux, the hostname part also includes namespace pid to disambiguate.

Mar 01 '23 05:03 yuja

https://github.com/martinvonz/jj/pull/1585 should improve this dramatically for *nix targets, using BSD style locks. It sounds like this should work fine on NFS, though a filesystem shared by clients on different OSs might have fun times.

May 07 '23 02:05 Ralith

Improve cleanup of `working_copy.lock`