nix icon indicating copy to clipboard operation
nix copied to clipboard

nix-env being aborted bricks try to run it again

Open nh2 opened this issue 8 years ago • 9 comments

I'm running NixOS in a KVM virtual machine to show this, but this problem should also exist on machines with little memory, or machines that suddenly suffer a power outage.

If I have configured too little memory (e.g. running kvm without the -m flag), this can happen:

[root@nixos:~]# nix-env -i htop
installing ‘htop-2.0.2’
these paths will be fetched (0.07 MiB download, 0.19 MiB unpacked):
  /nix/store/6dbi3g4hhnpc1r3rmnmj9ivxd3hzfypv-htop-2.0.2
fetching path ‘/nix/store/6dbi3g4hhnpc1r3rmnmj9ivxd3hzfypv-htop-2.0.2’...

*** Downloading ‘https://cache.nixos.org/nar/1hccfhb7m810b0ix2afjw6j830iigbb36lchshzpl1sb2wclhmjl.nar.xz’ (signed by ‘cache.nixos.org-1’) to ‘/nix/store/6dbi3g4hhnpc1r3rmnmj9ivxd3hzfypv-htop-2.0.2’...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 72180  100 72180    0     0  50005      0  0:00:01  0:00:01 --:--:-- 50020

building path(s) ‘/nix/store/ddz54781490iywc75drap5a15nv2ggc9-user-environment’
error: unable to fork: Cannot allocate memory

Afterwards, if I run it again, even if I have used -m 2048 now:

[root@nixos:~]# nix-env -i htop
installing ‘htop-2.0.2’
error: error parsing derivation ‘/nix/store/kvdgrl0mrkdz8k1yvwya2zyh70bhmkac-htop-2.0.2.drv’: expected string ‘Derive([’

I suppose that's because /nix/store/kvdgrl0mrkdz8k1yvwya2zyh70bhmkac-htop-2.0.2.drv is simply an empty dir.

It seems that a nix-env install is non-atomic this way (dir created, but without contents), and thus the machine somehow failing after the mkdir can create this situation.

nh2 avatar Feb 04 '17 03:02 nh2

Did you reset the VM between the first and the second run, or otherwise uncleanly shut down the VM? That's the only situation I can think of that would cause an empty file in the Nix store registered as a valid path.

It's possible to set sync-before-registering = true in nix.conf to force a sync() before valid paths are registered, but that has a pretty extreme performance impact.

edolstra avatar Feb 07 '17 10:02 edolstra

Yes, I killed the KVM with a kill signal (I found that appropriate to simulate a power outage).

I guess it would be OK to not do the sync() if we were able to detect and recover from cases like the above.

Regarding sync-before-registering (related #966), can you elaborate a bit what extreme means in this case? Maybe there are solutions to it. Is it that syncing after a download+register would head-of-line block subsquent package downloads? For example, in the ext* maintainer's post about fsync, a suggested strategy for our case of mutually independent downloads that only need to be synchronised / waited for at the very end, is to fsync() in a separate thread (and then join the threads when all the downloading is done).

nh2 avatar Feb 07 '17 12:02 nh2

It performs a full sync() before every valid path registration. Since there can be hundreds of path registrations during Nix evaluation, this would be pretty expensive. (In fact, on heavily loaded systems, sync is not guaranteed to finish in bounded time at all.)

edolstra avatar Feb 07 '17 12:02 edolstra

IIRC I read somewhere that Ted Ts'o (the ext4 maintainer) suggested in the dpkg mailing lists to use sync_file_range() on every modified/new file, then followed by fsync() on every modified/new file to get the best performance and still be crash-safe.

Maybe another (more complex) approach would be to not do fsync/etc. at all during path registration, but have a bool fsynced field for every path in the Nix db, and have nix-daemon periodically fsync non-fsynced paths in the background. Then after a crash the first Nix operation would just need to verify the checksums of non-fsynced paths in the database.

dezgeg avatar Feb 08 '17 10:02 dezgeg

@dezgeg would you mind finding that mailing list post? I'd be very insterested.

sync_file_range() works only on Linux though, so this would be an optimisation, but we'd still have to fix it for Nix running on other OSs.

Regarding your other suggestion: It does seem (as you suggest) more complex than fsyncing in separate threads during the nix-env -i, and waiting for the corresponding thread to finish before registration. It also wouldn't guarantee that everything is fully safely done when nix-env terminates, which I think is a good guarantee to have.

nh2 avatar Feb 09 '17 12:02 nh2

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=605009 it seems. I don't know if that's actually implemented in dpkg, would have to check the source.

Yes, sync_file_range() is an optimization only (even in Linux!), so simply #ifdef'ing it out the call in non-Linux is okay.

dezgeg avatar Feb 09 '17 13:02 dezgeg

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-store-corrupted/9947/1

nixos-discourse avatar Nov 10 '20 22:11 nixos-discourse

I marked this as stale due to inactivity. → More info

stale[bot] avatar Jun 02 '21 16:06 stale[bot]

I have seen many cases of corrupt store paths in a production environment. These are NixOS machines that tend to have unexpected power cuts.

I have a test case that simulates a power cut with NixOS tests and reproduces the problem here: https://github.com/squalus/nix-durability-tests. It can be run on several different file systems.

nix -L build github:squalus/nix-durability-tests#corrupt-contents-tests.xfs

Possible causes:

  1. Errors from close(2) are ignored in nix::ParseSink. (From man close: Failing to check the return value when closing a file may lead to silent loss of data.)
  2. fsync(2) is not run on store path files after writing them. This means the data may not be fully flushed to disk.
  3. fsync(2) is not run on the store path directories after writing them. This means the directory could have outdated contents.

For 2 and 3, we can run sync_file_range on the files when they finish writing in nix::ParseSink. This initiates the sync operation without waiting for it to finish. Then, after the archive is finished extracting, recurse through the store path and run fsync(2) on all the files and directories. This is similar to what dpkg does. (reference: https://github.com/nixos/nixpkgs/issues/15581#issuecomment-220831808)

squalus avatar Sep 22 '22 19:09 squalus