nix enabling ca-derivations can cause deadlock on "waiting for exclusive access to the Nix store for ca drvs..."

Describe the bug

After adding ca-derivations to experimental-features in nix.conf, many servers have all nix operations stuck on "waiting for exclusive access to the Nix store for ca drvs...".

It looks like this is a deadlock when upgrading the sqlite schema, but I don't know when it happens. It seems to have happened on most of our servers though so it must be easy to trigger. Perhaps it happens when nix-daemon is restarted with the setting added and builds are currently in progress? In any case, everything is then permanently deadlocked at that point, so existing builds completing don't fix it.

Aside from the deadlock, tools to inspect the sqlite version and status, and possibly upgrade it independently would be less error-prone than attempting to upgrade the schema implicitly on the fly. I wasn't expecting that merely enabling a feature would be a risky operation.

nix-env --version output

2.4.1

Additional context

This has been separately reported in https://github.com/NixOS/nixops/issues/1520, but I don't think this is a nixops-only problem.

Jun 13 '22 23:06 elaforge

I've worked around this by forcing the upgrade (update nix.conf, force a build of any drv, even if it fails) on machines when they're idle. I didn't get a verified reproduction, but it seems pretty easy to make it deadlock because I still got it several times when I tried without the initial drain. Perhaps if the machine is doing any build at the same time as the upgrade it will deadlock.

Jun 21 '22 18:06 elaforge

Still happening in 2.8.1

Jun 22 '22 19:06 dasJ

I also encountered this. My hunch is that it's a NixOS issue (https://github.com/NixOS/nixpkgs/issues/177142): Somehow nixos-rebuild switch seems to keep an instance of the old Nix running while it runs the new one, hence the deadlock. Not totally sure, but it seems to be the most likely culprit (esp. given that the updates seem to work fine when running outside of the script)

Jun 24 '22 07:06 thufschmitt

Aside from the deadlock, tools to inspect the sqlite version and status, and possibly upgrade it independently would be less error-prone than attempting to upgrade the schema implicitly on the fly.

There's an interesting tradeoff here. Upgrading the schema on-the-fly means that just using a new version might make the store backwards-incompatible. Otoh making that more independent would make things more complex both implementation-wise (because the code would have do deal with old db versions) and for users (because they would have to think about it).

I wasn't expecting that merely enabling a feature would be a risky operation.

Fwiw it's not :) The db schema change is intentionally just adding new tables/indexes/triggers, so that the old schema is still valid (meaning that it's always possible to transparently rollback to a non-ca world)

Jun 24 '22 07:06 thufschmitt

The problem is not due to the actual schema change, as far as I can tell, but in the locking. Presumably sqlite doesn't know it's a "safe" change or doesn't support atomic safe updates or whatever, or you wouldn't have needed that lock at all. But anyway since you do need the lock, now you have deadlocks... so I'd say it definitely was risky in that it permanently wedges up any nix operation until you back out by reverting the config change and restarting the daemon. In our case the backing out is also done via nix so it's worse than usual when nix itself is the problem.

My first thought was that something as dangerous as a manual global lock and schema upgrade should be done manually and explicitly, but I guess nix has been implicitly upgrading its schema for a long time now, and maybe the real problem is the manual lock part. And nix is full of manual locks, so maybe it's just that they have rules about acquire order or something which got broken for the schema upgrade code path... which is always the thing with manual locks. Anyway, I wound up recreating a manual upgrade by hand with the explicit drain, add enable-ca, realize a dummy drv, then undrain... so you could say the real problem is still the lock and I was working around with the drain. So maybe what nix needs is a less error-prone locking mechanism.

However I do feel that crossing a backwards incompatible line should be done explicitly so we know. Otherwise it's too easy to try a canary to find bugs, find said bugs, but then be stuck again because rollback doesn't work. And rollback is something nix advertises that it is good at!

It's surely more general than just nixos-rebuild since we don't use that and still saw it. It seems to happen if you are doing any nix operation when it gets added, but I didn't 100% verify that, just observed it. My other thought was that it happened if GC happened to be running, but later saw it when GC was not going so maybe not just that.

Jun 24 '22 21:06 elaforge

@thufschmitt I doubt this comes from nixos-rebuild. We have a home-grown deployment solution that does not call nix on the remote side (apart from the nix copy receiver), and we had the problem there as well. There are some scripts on the host where that happened that sometimes call nix run, this was probably the culprit. These timed out at some point but by the time they all were dead, new ones were already started, keeping the deadlock locked.

Something to avoid this would be killing off new processes rather than waiting for the lock.

Jul 04 '22 17:07 dasJ

I also encounter this while doing nix copy and killing the remote nix daemon fixed it.

Jul 05 '22 11:07 SuperSandro2000

I also encounter this while doing nix copy

Where you copy-ing to root@?

Jul 05 '22 14:07 thufschmitt

In our case, I copied as a non-root user and the daemon logged the connection (trusted)

Jul 05 '22 15:07 dasJ

Where you copy-ing to root@?

Yes. Are we not suppoed to do that? Or do we need to set NIX_REMOTE=daemon?

Jul 07 '22 15:07 SuperSandro2000

Where you copy-ing to root@?

Yes. Are we not suppoed to do that? Or do we need to set NIX_REMOTE=daemon?

Nope, there's nothing wrong with that in general. It's just unfortunate in that case because I assume the nix process that nix copy spawned required a more recent db schema than the daemon that was running, so it got stuck (waiting for the daemon to die, which obviously had no reason to happen).

We have a home-grown deployment solution that does not call nix on the remote side (apart from the nix copy receiver), and we had the problem there as well.

Is it calling switch-to-configuration still, or is it not NixOS-based at all?

There are some scripts on the host where that happened that sometimes call nix run, this was probably the culprit

That might be if they were building, but normally they shouldn't keep a lock on the db – these are only grabbed temporarily when needed, and because it's all nicely wrapped in an RAII interface, I have a reasonable confidence that they don't leak. But I wouldn't trust that too blindly either, so maybe they were keeping a lock after all. Were these running as root?

Jul 07 '22 17:07 thufschmitt

@thufschmitt It's calling switch-to-configuration, yes. But I'm pretty sure that doesn't launch a nix process

Jul 08 '22 15:07 dasJ

still an issue with nix (Nix) 2.10.3

how to break your nixos:

1. add ca-derivations to experimental-features in /etc/nixos/

# /etc/nixos/flake.nix
{
  inputs.nixpkgs.url = "github:NixOS/nixpkgs/b00aa8ded743862adc8d6cd3220e91fb333b86d3"; # nixpkgs-unstable-2022-08-13
  outputs = { self, ... }@inputs: {
    nixosConfigurations.laptop1 = inputs.nixpkgs.lib.nixosSystem {
      system = "x86_64-linux";
      specialArgs = {
        inherit inputs;
      };
      modules = [
        ({ pkgs, ... }: {
          # add: ca-derivations
          nix.extraOptions = ''
            experimental-features = nix-command flakes recursive-nix impure-derivations ca-derivations
          '';
        })
      ];
    };
  };
}

2. rebuild

sudo nixos-rebuild switch

3. try to rebuild again now rebuild should hang. with verbose flag, you should see the error waiting for exclusive access

sudo nixos-rebuild switch -v
# waiting for exclusive access to the Nix store for ca drvs...

how to fix your nixos:

4. remove ca-derivations from /etc/nix/nix.conf

sudo mv /etc/nix/nix.conf /etc/nix/nix.conf.broken
sudo cp /etc/nix/nix.conf.broken /etc/nix/nix.conf
sudo sed -i -E 's/(experimental-features =.*) ca-derivations/\1/' /etc/nix/nix.conf

5. revert step 1 = remove ca-derivations from experimental-features in /etc/nixos/ edit your /etc/nixos/flake.nix or /etc/nixos/configuration.nix

6. rebuild

sudo nixos-rebuild switch

Sep 01 '22 12:09 milahu

I ran into this when upgrading to 2.11.0 using the command

nixos-rebuild switch --use-remote-sudo --flake github:mschwaig/nixos-config/42d1f1f21d#

I cancelled the stuck command, rebooted and just retried. That worked for me, which fits with @elaforge's theory:

Perhaps if the machine is doing any build at the same time as the upgrade it will deadlock.

I would also guess that

nixos-rebuild boot [..]

should not run into this at all unless the new Nix really interferes with itself somehow.

Sep 04 '22 12:09 mschwaig

in the nixos-rebuild switch case:

a shared lock (lockType = ltRead) is acquired in local-store.cc:264. this succeeds

// nix/src/libstore/local-store.cc
LocalStore::LocalStore(const Params & params)
// line 264
    /* Acquire the big fat lock in shared mode to make sure that no
       schema upgrade is in progress. */
    Path globalLockPath = dbDir + "/big-lock"; // /nix/var/nix/db/big-lock
    globalLock = openLockFile(globalLockPath.c_str(), true);

    if (!lockFile(globalLock.get(), ltRead, false)) {
        printInfo("waiting for the big Nix store lock...");
        lockFile(globalLock.get(), ltRead, true); // this succeeds
    }

an exclusive lock (lockType = ltWrite) is acquired in local-store.cc:92. this hangs

// nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 92
        if (!lockFile(lockFd.get(), ltWrite, false)) {
            printInfo("waiting for exclusive access to the Nix store for ca drvs...");
            lockFile(lockFd.get(), ltWrite, true); // this hangs
        }

in theory, the exclusive lock will be released in local-store.cc:162 by acquiring a shared lock (lockType = ltRead)

// nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 162
        lockFile(lockFd.get(), ltRead, true);

not working: workaround to unblock line 92: release the shared lock (lockType = ltRead), so the lock is unlocked (lockType = ltNone), then acquire the exclusive lock (lockType = ltWrite)

// nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 92
        printInfo("checking exclusive lock");
        if (!lockFile(lockFd.get(), ltWrite, false)) {
            printInfo("checking shared lock");
            if (lockFile(lockFd.get(), ltRead, false)) {
                printInfo("releasing shared lock ...");
                lockFile(lockFd.get(), ltNone, true);
                printInfo("releasing shared lock done");
            }
            printInfo("waiting for exclusive access to the Nix store for ca drvs...");
            lockFile(lockFd.get(), ltWrite, true); // this hangs
        }

todo: why does "acquire exclusive lock" hang? does "acquire exclusive lock" work at all?

Sep 04 '22 21:09 milahu

I tried reproducing it on a fresh VM, but couldn't manage to do so (apart from manually keeping a Nix process open on the side).

Anyone encountering this, can you try running lsof /nix/var/nix/db/big-lock and see which processes are actually holding a lock (pstree or htop might be your friend)?

Sep 12 '22 13:09 thufschmitt

Just got it while trying to upgrade from 22.05 to 22.11, on a multi-user install The lsof command returns nothing, I looked at the processes running and no evident locks I even logged off the graphical interface, logged on a tty (the only processes running was basically gdm for the login screen, bash, and systemd stuff), and it didn't work either. My only choice was to sudo vim /etc/nix/nix.conf, remove the ca-derivations feature, force the save over the read-only file, and upgrade But if you want to try things out, I guess I can reproduce the bug fairly easily. On my nix dotfiles, I even have a target for cli / gui VM (for testing), so I'll try to reproduce it inside the VM, and you'll be able to use this for reproduciblity if you wan't to

Dec 02 '22 15:12 litchipi

But if you want to try things out, I guess I can reproduce the bug fairly easily.

If you can build a self-contained example that exhibits this, that would be truly awesome!

Dec 07 '22 05:12 thufschmitt

Just ran into this during a regular update of my desktop machine (NixOS 22.11 mainly), had to use the "edit /etc/nix/nix.conf" workaround. I had ca-derivations activated for a longer time before this (without consciously using it) without any problems, but this was pretty significant/annoying.

Dec 09 '22 12:12 futile

Ran into this problem again today.

$ sudo lsof /nix/var/nix/db/big-lock
[sudo] password for bryan: 
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
      Output information may be incomplete.
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
      Output information may be incomplete.
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
packageki  2729 root   10u   REG   0,33        0  271 /nix/var/nix/db/big-lock
nix-env   60764 root    4u   REG   0,33        0  271 /nix/var/nix/db/big-lock

Edit, it seems like PackageKit is holding a lock on it?

So, after killing the packagekit process, everything resumed as expected.

Dec 17 '22 13:12 bryanhonof

I am truly not an expert on this but it looks like there's a TOC/TOU between when Nix checks that no database migration is happening by acquiring a shared lock on big-lock:

https://github.com/NixOS/nix/blob/26c7602c390f8c511f326785b570918b2f468892/src/libstore/local-store.cc#L264-L272

And when it later acts on this check by acquiring an exclusive lock:

https://github.com/NixOS/nix/blob/26c7602c390f8c511f326785b570918b2f468892/src/libstore/local-store.cc#L92-L95

So if you start two Nix processes (or indeed two processes using the Nix store library, like PackageKit) at the same time, they can both get a shared lock on big-lock, and now they're both trying to get an exclusive lock which can't happen because they both still have the shared lock.

I guess a solution would be to try to get an exclusive lock right away, and only downgrade it to a shared lock after we've done the migration (or we've decided we don't need to do one).

Am I making any sense?

Dec 17 '22 14:12 ncfavier

There are subtleties though: I was unable to get a deadlock to happen on my machine, even with a very simple C program that just calls flock twice with a sleep in between. This seems to be due to flock not upgrading the lock atomically:

Converting a lock (shared to exclusive, or vice versa) is not guaranteed to be atomic: the existing lock is first removed, and then a new lock is established. Between these two steps, a pending lock request by another process may be granted, with the result that the conversion either blocks, or fails if LOCK_NB was specified. (This is the original BSD behavior, and occurs on many other implementations.)

This lack of atomicity prevents the deadlock, which makes me wonder if there are situations where it does upgrade the lock atomically.

People who experienced this bug: what version of Linux are/were you on? What type of filesystem is your /nix/var/nix/db on?

I will still submit a PR under the assumption that this is what's happening, but it certainly is puzzling...

not working: workaround to unblock line 92: release the shared lock (lockType = ltRead), so the lock is unlocked (lockType = ltNone), then acquire the exclusive lock (lockType = ltWrite)

@milahu could you elaborate on this? Did you confirm that releasing the lock with ltNone before waiting for the exclusive lock still runs into a deadlock? Because that would invalidate my theory...

Dec 18 '22 11:12 ncfavier

People who experienced this bug: what version of Linux are/were you on? What type of filesystem is your /nix/var/nix/db on?

NixOS (22.11 mainly), ZFS

Dec 18 '22 13:12 futile

@milahu could you elaborate on this?

i tried to reproduce with a patched nix 2.13 but no luck = enabling the ca-derivations feature does not cause a deadlock

i would have to downgrade the database schema to reach migrateCASchema

Dec 23 '22 15:12 milahu

Still happening on Nix 2.17.0. Occurred when running nixos-rebuild switch upon adding ca-derivations feature. FS is ZFS.

sudo lsof /nix/var/nix/db/big-lock
COMMAND      PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nix-daemo 392341 root    3uR  REG   0,29        0  134 /nix/var/nix/db/big-lock
nix-daemo 995083 root    3u   REG   0,29        0  134 /nix/var/nix/db/big-lock
nix-daemo 995677 root    3u   REG   0,29        0  134 /nix/var/nix/db/big-lock
nix-daemo 996238 root    3u   REG   0,29        0  134 /nix/var/nix/db/big-lock
nix-daemo 997332 root    3u   REG   0,29        0  134 /nix/var/nix/db/big-lock

sudo pkill nix-daemon did not stop the processes, however sudo pkill -KILL nix-daemon and then sudo systemctl start nix-daemon seems to…be fine? I can build CA drvs.

Aug 30 '23 01:08 amarshall

Doing a nixos-rebuild boot + reboot instead, should also fix the deadlock.

Aug 17 '24 09:08 Mic92