Failed to symlink, File exists
Describe the bug
I run lorri daemon in its own shell. It had been working correctly for quite some time before it crashed with this error log:
name = 'lorri'
operating_system = 'unix:Unknown'
crate_version = '1.0.0'
explanation = '''
Panic occurred in file 'src/build_loop.rs' at line 135
'''
cause = '''
Unrecoverable error:
Io {
msg: "Failed to symlink /home/user/.cache/lorri/gc_roots/43b29393751a4b4155eb2df05ec2bf91/gc_root/shell_gc_root to /nix/var/nix/gcroots/per-user/user/43b29393751a4b4155eb2df05ec2bf91-shell_gc_root: File exists (os error 17)",
}'''
method = 'Panic'
backtrace = '''
0: 0x55ace73cf8f9 - std::panicking::try::do_call::h701385363879df93
1: 0x55ace75c8b4a - __rust_maybe_catch_panic
2: 0x55ace73ed1a6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h7d93166c1161340f
3: 0x55ace75a88bf - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h3e052558cbe55fd8
4: 0x55ace75ae620 - std::sys::unix::thread::Thread::new::thread_start::h5262a6ff4b26522f
5: 0x7fe132c17ef7 - start_thread
6: 0x7fe13293722f - __GI___clone
7: 0x0 - <unknown>'''
To Reproduce
I'm not completely sure the steps that lead to the error, but I did notice before it crashed that my shell was missing a library I had specified and lorri wasn't building a new environment, so I ran touch default.nix to refresh the shell. After restarting lorri daemon it built the new environment without crashing.
Metadata
The exact version of lorri I'm running is: /nix/store/slvajmdzd3v080fn9b9kqypqdb4bgwnk-lorri-1.0/bin/lorri. I setup lorri with services.lorri.enable = true; in my configuration.nix. I'm on nixos-19.09.2470.4a0df0ce263 (I got that from the channel's manifest file)
Thanks! Here's what I've found:
Bottom-up:
- lorri attempted to create a symlink in
/nix/var/nix/gcroots, but that symlink already existed - these symlinks are created in
Roots::add, which is only called fromRoots::create_roots
Top-down:
BuildLoop::forevercallsBuildLoop::oncecallsBuidLoop::root_resultcallsRoots::create_roots
I can think of two explanations:
- There were two lorri instances that raced to build the project; one succeeded, the other failed with the error in the description. There cannot be two daemons at the same time because only one process can own the daemon socket, but there could have been a
lorri daemonand alorri watchprocess, for instance. - There was a hash collision. Unlikely.
@Skyfold, is it possible that there was a lorri daemon and a lorri watch process running at the same time here?
In terms of fixes / mitigations, I can think of the following:
- The code that creates the symlink could check if the symlink already exists and points to the right cache directory. If so, there is no need to create the symlink. If the symlink already exists but points somewhere else, that's an issue. I think lorri should probably fail with a helpful error in that case since it points to a larger problem with the cache. This would be a safe way to deal with a race.
I was only running a lorri daemon. I checked my shell history and the last time I ran lorri watch was before my last restart (more than a month ago).
Is it possible lorri daemon duplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one with touch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?
Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.
I was only running a
lorri daemon. I checked my shell history and the last time I ranlorri watchwas before my last restart (more than a month ago).
Thanks for the clarification.
Is it possible
lorri daemonduplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one withtouch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?
#341 is about lorri duplicating builds. Crucially, lorri should never run multiple builds for the same project in parallel; #341 is about lorri enqueueing redundant builds and then running them sequentially.
Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running
lorri watchandlorri daemonat the same time.
:+1:
Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.
The symlink check overlaps with https://github.com/target/lorri/issues/387, which is about checking whether the symlink is exactly the same before overwriting it.
Which makes me wonder: we already have situations where the same symlink is overwritten multiple times, right? So how is this different?
On Wed, May 6, 2020 at 11:19 AM Leonhard Markert [email protected] wrote:
I was only running a lorri daemon. I checked my shell history and the last time I ran lorri watch was before my last restart (more than a month ago).
Thanks for the clarification.
Is it possible lorri daemon duplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one with touch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?
#341 https://github.com/target/lorri/issues/341 is about lorri duplicating builds. Crucially, lorri should never run multiple builds for the same project in parallel; #341 https://github.com/target/lorri/issues/341 is about lorri enqueueing redundant builds and then running them sequentially.
Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.
👍
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/target/lorri/issues/389#issuecomment-624535838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYB5ZWTL2VAMUYFC4R3DCTRQETRRANCNFSM4MYFORIA .