Improve cleanup of `working_copy.lock`
As @colemickens mentioned in #83, the working_copy.lock file too easily gets left behind without getting cleaned up.
IIRC, the way Mercurial handles it is by writing the process ID to the file. Then other processes waiting for the lock can check if the process that took the lock is still running.
We may also want to reduce the risk of even leaving the file there.
I keep being bitten by this since I have watch jj log and watch diff -s running in the background.
In order to reproduce such problems, a fun stress test on the jj repository is:
cd /tmp && jj git clone https://github.com/martinvonz/jj.git && cd jj # Don't do this on a repository you actually use
while true; do jj rebase -s e0ea -d 7dbb- && sleep 0.1 && jj rebase -s e0ea -d 7dbb; sleep 0.1; done
This can be combined with watch -n 0.5 diff -s and/or watch -n 0.5 jj log running in other terminals.
I get two kinds of errors:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Other("Git commit '45a1675ae9796e8076b1b45f3508aea88c5ef70e' already exists with different associated non-Git meta-data")', lib/src/store.rs:117:60
and
thread 'main' panicked at 'failed to create lock file /home/ilyagr/tmp/jj/./.jj/working_copy/working_copy.lock: File exists (os error 17)', lib/src/lock.rs:56:25
(the latter may only happen when also running watch, not sure). Moreover, watch jj diff -s combined with the rebasing seems to lead to a state when I have to manually delete working_copy.lock. Otherwise, even without anything else running, jj log doesn't work.
Anecdotally, I've run into this fairly often by invoking jj commands interactively via fzf https://gist.github.com/chooglen/c76b2a6950aade154408fad2c42f97d7
I didn't have to work very hard to hit this:
slynux〉/home/cole/code/_jj/nixpkgs〉jj git clone https://github.com/nixos/nixpkgs master
Fetching into new repo in "/home/cole/code/_jj/nixpkgs/master"
Working copy now at: 3e486fae5655 (no description set)
slynux〉/home/cole/code/_jj/nixpkgs/master〉jj log
^C
slynux〉/home/cole/code/_jj/nixpkgs/master〉jj log | head -n3
thread 'main' panicked at 'failed to create lock file /home/cole/code/_jj/nixpkgs/master/./.jj/working_copy/working_copy.l
ock: File exists (os error 17)', lib/src/lock.rs:56:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
edit: it's been long enough that I forgot I triggered this initial issue -_-
^C
Yep, jj doesn't handle SIGINT/SIGTERM yet and the process is just terminated.
I looked at implementing that handling for a bit, but I think it would be a much better use of resources to do the PID trick or similar instead. Cleaning up lock files on signal would require a fair amount of piping shared state around to every lockfile creator, would complicate cleaning up a jj that's stuck on e.g. a broken network filesystem, and would still not solve the equivalent problem in the face of unhandleable interruptions like power loss.
I agree something like PID trick is really nice, but I think we'll need to handle signals anyway to shutdown pager cleanly.
I have no idea about the best practice in Rust, periodically checking atomic boolean, channel, signalfd, async?
we'll need to handle signals anyway to shutdown pager cleanly.
Good point. @chooglen, since you're working on pager support, don't miss this.
we'll need to handle signals anyway to shutdown pager cleanly.
Good point. @chooglen, since you're working on pager support, don't miss this.
It's not a strict requirement, ftr. Terminal would be messed up if the parent process killed by ctrl+c, but that shouldn't be a blocker for the initial pager support.
Terminal would be messed up if the parent process killed by ctrl+c
I'm not sure this is true. Killing a parent process doesn't automatically kill the child process; a child may opt into receiving SIGHUP, which less probably does and then handles gracefully. ^C on a terminal is broadcast to the entire process group, per https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap11.html#tag_11_01_09, although testing indicates less seems to just ignore SIGINT anyway. ^C while the pager is open will therefore probably terminate jj and leave the pager running gracefully without any special effort.
I think we'll need to handle signals anyway to shutdown pager cleanly.
Just handling a signal is fine, and may be useful for cleaning up terminal state; piping state around everywhere to clean up lockfiles is what I want to avoid.
^C on a terminal is broadcast to the entire process group, per https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap11.html#tag_11_01_09, although testing indicates
lessseems to just ignore SIGINT anyway. ^C while the pager is open will therefore probably terminate jj and leave the pager running gracefully without any special effort.
Yeah, so after ^C, less process would be orphaned while maybe it's interacting with the tty. I think that's the problem.
Just handling a signal is fine, and may be useful for cleaning up terminal state; piping state around everywhere to clean up lockfiles is what I want to avoid.
Doing things like that in signal handler would be a mess, I agree.
lessprocess would be orphaned while maybe it's interacting with the tty
less itself can and does handle this case, is my point. I don't think you can get it into a bad state with anything short of SIGKILL.
lessitself can and does handle this case, is my point. I don't think you can get it into a bad state with anything short of SIGKILL.
If less cleaned up the terminal after the shell printed prompt, the prompt would disappear.
If less were still alive, the prompt would be visible, but any input would be consumed by less.
The end result isn't terribly bad, but it's not clean either.
I tested this empirically with
fn main() {
let mut pager = Command::new("less").stdin(Stdio::piped()).spawn().unwrap();
let pipe = pager.stdin.take().unwrap();
for i in 0.. {
(&pipe).write_all(format!("foo {}\n", i).as_bytes()).unwrap();
thread::sleep(Duration::from_millis(100));
}
}
less does shut down automatically, but it does indeed also leave the terminal in a kind of funny state.
Thanks for checking. As long as we don't leave a process behind on every ^C, it seems fine to leave the signal handling for later (as long as the terminal state isn't too broken - I'm guessing reset will be enough).
Yeah, nothing sticky, even hammering ^L a couple times cleans it up.
For dealing with SIGTERM and ^C, there is https://crates.io/crates/ctrlc and https://crates.io/crates/signal-hook for more fine control. I was trying to figure out how to get a TUI program to clean up the terminal on kill for hwatch (for now, you might want to use https://github.com/blacknon/hwatch/pull/77, in particular the --no-title option), and these seem to be the best ways, but I haven't really tried either.
@ilyagr I got the same error you reported before:
thread 'main' panicked at 'failed to create lock file /Users/isinyaaa/repos/redhat/covscan/.jj/working_copy/working_copy.lock: File exists (os error 17)', lib/src/lock.rs:56:25
But I haven't been using watch at all. It seems that I caused it by simply having a very big file that wasn't ignored, so jj broke while trying to add it to the working copy. After I deleted the file and jj's working_copy.lock, the error disappeared.
IDK if this is related to this issue at all, though... Should I open a new one?
That's the same issue. The file is left behind if the process doesn't exit properly while snapshotting or updating the working copy.
I can understand that a lock file can be left behind every once in a while, but does it have to take so long to detect?
$ time jj status
thread 'main' panicked at 'failed to create lock file /home/hanwen/vc/grit/.jj/working_copy/working_copy.lock: File exists (os error 17)', lib/src/lock.rs:56:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
real 0m5.889s
user 0m0.016s
sys 0m0.014s
IIRC, the way Mercurial handles it is by writing the process ID to the file. Then other processes waiting for the lock can check if the process that took the lock is still running.
why not use operating system file locks? If the process dies, the lock is released automatically.
I can understand that a lock file can be left behind every once in a while, but does it have to take so long to detect?
If the file is there because another process is currently snapshotting or updating the working copy, then we want to give that process at least a few seconds to finish. If the file contained the process id, then we could notice right away if the process was no longer there.
why not use operating system file locks? If the process dies, the lock is released automatically.
That might be a good solution. Do such locks exist on all major platforms, including Windows? Do they work on all major file systems, or would we need a fallback (it may still be a good idea)?
If the file is there because another process is currently snapshotting or updating the working copy, then we want to give that process at least a few seconds to finish.
If you wait X seconds, it depends on the size of the repo if X is an appropriate value. Having concurrent processes should be rare, and the user would normally be aware of this? I think the timeout could be much shorter.
That might be a good solution. Do such locks exist on all major platforms, including Windows? Do they work on all major file systems, or would we need a fallback (it may still be a good idea)?
In general they may not work on network file systems (eg. NFS), but the trick of storing PID in a file also doesn't work if a remote computer writes a lockfile.
for lock files on unix, see https://lwn.net/Articles/586904/ - I think BSD style locking is what you'd want (should be supported on OSX, which is BSD too)
I'm not very familiar with Windows, but https://en.wikipedia.org/wiki/File_locking suggests that locking files is the default on Windows.
For non-networked filesystems, the OS kernel should handle the locks (as all contenders for the lock are running on the same machine, there is no need for the filesystem to do anything.)
Thanks for the links.
If you wait X seconds, it depends on the size of the repo if X is an appropriate value.
Right, that's why the value is unnecessarily high in small repos :(
Having concurrent processes should be rare, and the user would normally be aware of this? I think the timeout could be much shorter.
Since there's not yet any integration by any tools (IDEs/editors, mostly), I agree that the risk is very small.
In general they may not work on network file systems (eg. NFS), but the trick of storing PID in a file also doesn't work if a remote computer writes a lockfile.
Mercurial stores the hostname to help with that, but I haven't checked how it uses it (maybe it doesn't break the lock if it was taken on another host?).
Mercurial stores the hostname to help with that, but I haven't checked how it uses it (maybe it doesn't break the lock if it was taken on another host?).
Correct. On Linux, the hostname part also includes namespace pid to disambiguate.
https://github.com/martinvonz/jj/pull/1585 should improve this dramatically for *nix targets, using BSD style locks. It sounds like this should work fine on NFS, though a filesystem shared by clients on different OSs might have fun times.