lorri icon indicating copy to clipboard operation
lorri copied to clipboard

Failed to symlink, File exists

Open Skyfold opened this issue 5 years ago • 4 comments

Describe the bug

I run lorri daemon in its own shell. It had been working correctly for quite some time before it crashed with this error log:

name = 'lorri'
operating_system = 'unix:Unknown'
crate_version = '1.0.0'
explanation = '''
Panic occurred in file 'src/build_loop.rs' at line 135
'''
cause = '''
Unrecoverable error:
Io {
    msg: "Failed to symlink /home/user/.cache/lorri/gc_roots/43b29393751a4b4155eb2df05ec2bf91/gc_root/shell_gc_root to /nix/var/nix/gcroots/per-user/user/43b29393751a4b4155eb2df05ec2bf91-shell_gc_root: File exists (os error 17)",
}'''
method = 'Panic'
backtrace = '''

   0: 0x55ace73cf8f9 - std::panicking::try::do_call::h701385363879df93
   1: 0x55ace75c8b4a - __rust_maybe_catch_panic
   2: 0x55ace73ed1a6 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h7d93166c1161340f
   3: 0x55ace75a88bf - <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once::h3e052558cbe55fd8
   4: 0x55ace75ae620 - std::sys::unix::thread::Thread::new::thread_start::h5262a6ff4b26522f
   5: 0x7fe132c17ef7 - start_thread
   6: 0x7fe13293722f - __GI___clone
   7:        0x0 - <unknown>'''

To Reproduce

I'm not completely sure the steps that lead to the error, but I did notice before it crashed that my shell was missing a library I had specified and lorri wasn't building a new environment, so I ran touch default.nix to refresh the shell. After restarting lorri daemon it built the new environment without crashing.

Metadata

The exact version of lorri I'm running is: /nix/store/slvajmdzd3v080fn9b9kqypqdb4bgwnk-lorri-1.0/bin/lorri. I setup lorri with services.lorri.enable = true; in my configuration.nix. I'm on nixos-19.09.2470.4a0df0ce263 (I got that from the channel's manifest file)

Skyfold avatar May 03 '20 15:05 Skyfold

Thanks! Here's what I've found:

Bottom-up:

  • lorri attempted to create a symlink in /nix/var/nix/gcroots, but that symlink already existed
  • these symlinks are created in Roots::add, which is only called from Roots::create_roots

Top-down:

  • BuildLoop::forever calls BuildLoop::once calls BuidLoop::root_result calls Roots::create_roots

I can think of two explanations:

  • There were two lorri instances that raced to build the project; one succeeded, the other failed with the error in the description. There cannot be two daemons at the same time because only one process can own the daemon socket, but there could have been a lorri daemon and a lorri watch process, for instance.
  • There was a hash collision. Unlikely.

@Skyfold, is it possible that there was a lorri daemon and a lorri watch process running at the same time here?

In terms of fixes / mitigations, I can think of the following:

  • The code that creates the symlink could check if the symlink already exists and points to the right cache directory. If so, there is no need to create the symlink. If the symlink already exists but points somewhere else, that's an issue. I think lorri should probably fail with a helpful error in that case since it points to a larger problem with the cache. This would be a safe way to deal with a race.

curiousleo avatar May 04 '20 08:05 curiousleo

I was only running a lorri daemon. I checked my shell history and the last time I ran lorri watch was before my last restart (more than a month ago).

Is it possible lorri daemon duplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one with touch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?

Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.

Skyfold avatar May 04 '20 09:05 Skyfold

I was only running a lorri daemon. I checked my shell history and the last time I ran lorri watch was before my last restart (more than a month ago).

Thanks for the clarification.

Is it possible lorri daemon duplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one with touch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?

#341 is about lorri duplicating builds. Crucially, lorri should never run multiple builds for the same project in parallel; #341 is about lorri enqueueing redundant builds and then running them sequentially.

Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.

:+1:

curiousleo avatar May 06 '20 09:05 curiousleo

Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.

The symlink check overlaps with https://github.com/target/lorri/issues/387, which is about checking whether the symlink is exactly the same before overwriting it.

Which makes me wonder: we already have situations where the same symlink is overwritten multiple times, right? So how is this different?

On Wed, May 6, 2020 at 11:19 AM Leonhard Markert [email protected] wrote:

I was only running a lorri daemon. I checked my shell history and the last time I ran lorri watch was before my last restart (more than a month ago).

Thanks for the clarification.

Is it possible lorri daemon duplicated the build? I'm only guessing this because I edited my cabal file, which should have triggered a build, but there was no notification that a new build had started. So, I manually triggered one with touch default.nix. Maybe lorri was already building the shell, but failed to notify or keep track of this, so it started a second build resulting in the above error?

#341 https://github.com/target/lorri/issues/341 is about lorri duplicating builds. Crucially, lorri should never run multiple builds for the same project in parallel; #341 https://github.com/target/lorri/issues/341 is about lorri enqueueing redundant builds and then running them sequentially.

Either way checking if the symlink already exists and points to the right cache directory would handle this and cases where someone was running lorri watch and lorri daemon at the same time.

👍

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/target/lorri/issues/389#issuecomment-624535838, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYB5ZWTL2VAMUYFC4R3DCTRQETRRANCNFSM4MYFORIA .

Profpatsch avatar May 06 '20 11:05 Profpatsch