enabling ca-derivations can cause deadlock on "waiting for exclusive access to the Nix store for ca drvs..."
Describe the bug
After adding ca-derivations to experimental-features in nix.conf, many servers have all nix operations stuck on "waiting for exclusive access to the Nix store for ca drvs...".
It looks like this is a deadlock when upgrading the sqlite schema, but I don't know when it happens. It seems to have happened on most of our servers though so it must be easy to trigger. Perhaps it happens when nix-daemon is restarted with the setting added and builds are currently in progress? In any case, everything is then permanently deadlocked at that point, so existing builds completing don't fix it.
Aside from the deadlock, tools to inspect the sqlite version and status, and possibly upgrade it independently would be less error-prone than attempting to upgrade the schema implicitly on the fly. I wasn't expecting that merely enabling a feature would be a risky operation.
nix-env --version output
2.4.1
Additional context
This has been separately reported in https://github.com/NixOS/nixops/issues/1520, but I don't think this is a nixops-only problem.
I've worked around this by forcing the upgrade (update nix.conf, force a build of any drv, even if it fails) on machines when they're idle. I didn't get a verified reproduction, but it seems pretty easy to make it deadlock because I still got it several times when I tried without the initial drain. Perhaps if the machine is doing any build at the same time as the upgrade it will deadlock.
Still happening in 2.8.1
I also encountered this. My hunch is that it's a NixOS issue (https://github.com/NixOS/nixpkgs/issues/177142): Somehow nixos-rebuild switch seems to keep an instance of the old Nix running while it runs the new one, hence the deadlock. Not totally sure, but it seems to be the most likely culprit (esp. given that the updates seem to work fine when running outside of the script)
Aside from the deadlock, tools to inspect the sqlite version and status, and possibly upgrade it independently would be less error-prone than attempting to upgrade the schema implicitly on the fly.
There's an interesting tradeoff here. Upgrading the schema on-the-fly means that just using a new version might make the store backwards-incompatible. Otoh making that more independent would make things more complex both implementation-wise (because the code would have do deal with old db versions) and for users (because they would have to think about it).
I wasn't expecting that merely enabling a feature would be a risky operation.
Fwiw it's not :) The db schema change is intentionally just adding new tables/indexes/triggers, so that the old schema is still valid (meaning that it's always possible to transparently rollback to a non-ca world)
The problem is not due to the actual schema change, as far as I can tell, but in the locking. Presumably sqlite doesn't know it's a "safe" change or doesn't support atomic safe updates or whatever, or you wouldn't have needed that lock at all. But anyway since you do need the lock, now you have deadlocks... so I'd say it definitely was risky in that it permanently wedges up any nix operation until you back out by reverting the config change and restarting the daemon. In our case the backing out is also done via nix so it's worse than usual when nix itself is the problem.
My first thought was that something as dangerous as a manual global lock and schema upgrade should be done manually and explicitly, but I guess nix has been implicitly upgrading its schema for a long time now, and maybe the real problem is the manual lock part. And nix is full of manual locks, so maybe it's just that they have rules about acquire order or something which got broken for the schema upgrade code path... which is always the thing with manual locks. Anyway, I wound up recreating a manual upgrade by hand with the explicit drain, add enable-ca, realize a dummy drv, then undrain... so you could say the real problem is still the lock and I was working around with the drain. So maybe what nix needs is a less error-prone locking mechanism.
However I do feel that crossing a backwards incompatible line should be done explicitly so we know. Otherwise it's too easy to try a canary to find bugs, find said bugs, but then be stuck again because rollback doesn't work. And rollback is something nix advertises that it is good at!
It's surely more general than just nixos-rebuild since we don't use that and still saw it. It seems to happen if you are doing any nix operation when it gets added, but I didn't 100% verify that, just observed it. My other thought was that it happened if GC happened to be running, but later saw it when GC was not going so maybe not just that.
@thufschmitt I doubt this comes from nixos-rebuild. We have a home-grown deployment solution that does not call nix on the remote side (apart from the nix copy receiver), and we had the problem there as well.
There are some scripts on the host where that happened that sometimes call nix run, this was probably the culprit. These timed out at some point but by the time they all were dead, new ones were already started, keeping the deadlock locked.
Something to avoid this would be killing off new processes rather than waiting for the lock.
I also encounter this while doing nix copy and killing the remote nix daemon fixed it.
I also encounter this while doing nix copy
Where you copy-ing to root@?
In our case, I copied as a non-root user and the daemon logged the connection (trusted)
Where you copy-ing to
root@?
Yes. Are we not suppoed to do that? Or do we need to set NIX_REMOTE=daemon?
Where you copy-ing to
root@?Yes. Are we not suppoed to do that? Or do we need to set
NIX_REMOTE=daemon?
Nope, there's nothing wrong with that in general. It's just unfortunate in that case because I assume the nix process that nix copy spawned required a more recent db schema than the daemon that was running, so it got stuck (waiting for the daemon to die, which obviously had no reason to happen).
We have a home-grown deployment solution that does not call nix on the remote side (apart from the nix copy receiver), and we had the problem there as well.
Is it calling switch-to-configuration still, or is it not NixOS-based at all?
There are some scripts on the host where that happened that sometimes call nix run, this was probably the culprit
That might be if they were building, but normally they shouldn't keep a lock on the db – these are only grabbed temporarily when needed, and because it's all nicely wrapped in an RAII interface, I have a reasonable confidence that they don't leak. But I wouldn't trust that too blindly either, so maybe they were keeping a lock after all. Were these running as root?
@thufschmitt It's calling switch-to-configuration, yes. But I'm pretty sure that doesn't launch a nix process
still an issue with nix (Nix) 2.10.3
how to break your nixos:
1. add ca-derivations to experimental-features in /etc/nixos/
# /etc/nixos/flake.nix
{
inputs.nixpkgs.url = "github:NixOS/nixpkgs/b00aa8ded743862adc8d6cd3220e91fb333b86d3"; # nixpkgs-unstable-2022-08-13
outputs = { self, ... }@inputs: {
nixosConfigurations.laptop1 = inputs.nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
specialArgs = {
inherit inputs;
};
modules = [
({ pkgs, ... }: {
# add: ca-derivations
nix.extraOptions = ''
experimental-features = nix-command flakes recursive-nix impure-derivations ca-derivations
'';
})
];
};
};
}
2. rebuild
sudo nixos-rebuild switch
3. try to rebuild again
now rebuild should hang.
with verbose flag, you should see the error waiting for exclusive access
sudo nixos-rebuild switch -v
# waiting for exclusive access to the Nix store for ca drvs...
how to fix your nixos:
4. remove ca-derivations from /etc/nix/nix.conf
sudo mv /etc/nix/nix.conf /etc/nix/nix.conf.broken
sudo cp /etc/nix/nix.conf.broken /etc/nix/nix.conf
sudo sed -i -E 's/(experimental-features =.*) ca-derivations/\1/' /etc/nix/nix.conf
5. revert step 1 = remove ca-derivations from experimental-features in /etc/nixos/
edit your /etc/nixos/flake.nix or /etc/nixos/configuration.nix
6. rebuild
sudo nixos-rebuild switch
I ran into this when upgrading to 2.11.0 using the command
nixos-rebuild switch --use-remote-sudo --flake github:mschwaig/nixos-config/42d1f1f21d#
I cancelled the stuck command, rebooted and just retried. That worked for me, which fits with @elaforge's theory:
Perhaps if the machine is doing any build at the same time as the upgrade it will deadlock.
I would also guess that
nixos-rebuild boot [..]
should not run into this at all unless the new Nix really interferes with itself somehow.
in the nixos-rebuild switch case:
a shared lock (lockType = ltRead) is acquired in local-store.cc:264. this succeeds
// nix/src/libstore/local-store.cc
LocalStore::LocalStore(const Params & params)
// line 264
/* Acquire the big fat lock in shared mode to make sure that no
schema upgrade is in progress. */
Path globalLockPath = dbDir + "/big-lock"; // /nix/var/nix/db/big-lock
globalLock = openLockFile(globalLockPath.c_str(), true);
if (!lockFile(globalLock.get(), ltRead, false)) {
printInfo("waiting for the big Nix store lock...");
lockFile(globalLock.get(), ltRead, true); // this succeeds
}
an exclusive lock (lockType = ltWrite) is acquired in local-store.cc:92. this hangs
// nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 92
if (!lockFile(lockFd.get(), ltWrite, false)) {
printInfo("waiting for exclusive access to the Nix store for ca drvs...");
lockFile(lockFd.get(), ltWrite, true); // this hangs
}
in theory, the exclusive lock will be released in local-store.cc:162 by acquiring a shared lock (lockType = ltRead)
// nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 162
lockFile(lockFd.get(), ltRead, true);
not working: workaround to unblock line 92: release the shared lock (lockType = ltRead), so the lock is unlocked (lockType = ltNone), then acquire the exclusive lock (lockType = ltWrite)
// nix/src/libstore/local-store.cc
void migrateCASchema(SQLite& db, Path schemaPath, AutoCloseFD& lockFd)
// line 92
printInfo("checking exclusive lock");
if (!lockFile(lockFd.get(), ltWrite, false)) {
printInfo("checking shared lock");
if (lockFile(lockFd.get(), ltRead, false)) {
printInfo("releasing shared lock ...");
lockFile(lockFd.get(), ltNone, true);
printInfo("releasing shared lock done");
}
printInfo("waiting for exclusive access to the Nix store for ca drvs...");
lockFile(lockFd.get(), ltWrite, true); // this hangs
}
todo: why does "acquire exclusive lock" hang? does "acquire exclusive lock" work at all?
I tried reproducing it on a fresh VM, but couldn't manage to do so (apart from manually keeping a Nix process open on the side).
Anyone encountering this, can you try running lsof /nix/var/nix/db/big-lock and see which processes are actually holding a lock (pstree or htop might be your friend)?
Just got it while trying to upgrade from 22.05 to 22.11, on a multi-user install
The lsof command returns nothing, I looked at the processes running and no evident locks
I even logged off the graphical interface, logged on a tty (the only processes running was basically gdm for the login screen, bash, and systemd stuff), and it didn't work either.
My only choice was to sudo vim /etc/nix/nix.conf, remove the ca-derivations feature, force the save over the read-only file, and upgrade
But if you want to try things out, I guess I can reproduce the bug fairly easily.
On my nix dotfiles, I even have a target for cli / gui VM (for testing), so I'll try to reproduce it inside the VM, and you'll be able to use this for reproduciblity if you wan't to
But if you want to try things out, I guess I can reproduce the bug fairly easily.
If you can build a self-contained example that exhibits this, that would be truly awesome!
Just ran into this during a regular update of my desktop machine (NixOS 22.11 mainly), had to use the "edit /etc/nix/nix.conf" workaround. I had ca-derivations activated for a longer time before this (without consciously using it) without any problems, but this was pretty significant/annoying.
Ran into this problem again today.
$ sudo lsof /nix/var/nix/db/big-lock
[sudo] password for bryan:
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
Output information may be incomplete.
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
packageki 2729 root 10u REG 0,33 0 271 /nix/var/nix/db/big-lock
nix-env 60764 root 4u REG 0,33 0 271 /nix/var/nix/db/big-lock
Edit, it seems like PackageKit is holding a lock on it?
So, after killing the packagekit process, everything resumed as expected.
I am truly not an expert on this but it looks like there's a TOC/TOU between when Nix checks that no database migration is happening by acquiring a shared lock on big-lock:
https://github.com/NixOS/nix/blob/26c7602c390f8c511f326785b570918b2f468892/src/libstore/local-store.cc#L264-L272
And when it later acts on this check by acquiring an exclusive lock:
https://github.com/NixOS/nix/blob/26c7602c390f8c511f326785b570918b2f468892/src/libstore/local-store.cc#L92-L95
So if you start two Nix processes (or indeed two processes using the Nix store library, like PackageKit) at the same time, they can both get a shared lock on big-lock, and now they're both trying to get an exclusive lock which can't happen because they both still have the shared lock.
I guess a solution would be to try to get an exclusive lock right away, and only downgrade it to a shared lock after we've done the migration (or we've decided we don't need to do one).
Am I making any sense?
There are subtleties though: I was unable to get a deadlock to happen on my machine, even with a very simple C program that just calls flock twice with a sleep in between. This seems to be due to flock not upgrading the lock atomically:
Converting a lock (shared to exclusive, or vice versa) is not guaranteed to be atomic: the existing lock is first removed, and then a new lock is established. Between these two steps, a pending lock request by another process may be granted, with the result that the conversion either blocks, or fails if LOCK_NB was specified. (This is the original BSD behavior, and occurs on many other implementations.)
This lack of atomicity prevents the deadlock, which makes me wonder if there are situations where it does upgrade the lock atomically.
People who experienced this bug: what version of Linux are/were you on? What type of filesystem is your /nix/var/nix/db on?
I will still submit a PR under the assumption that this is what's happening, but it certainly is puzzling...
not working: workaround to unblock line 92: release the shared lock (lockType = ltRead), so the lock is unlocked (lockType = ltNone), then acquire the exclusive lock (lockType = ltWrite)
@milahu could you elaborate on this? Did you confirm that releasing the lock with ltNone before waiting for the exclusive lock still runs into a deadlock? Because that would invalidate my theory...
People who experienced this bug: what version of Linux are/were you on? What type of filesystem is your /nix/var/nix/db on?
NixOS (22.11 mainly), ZFS
@milahu could you elaborate on this?
i tried to reproduce with a patched nix 2.13 but no luck
= enabling the ca-derivations feature does not cause a deadlock
i would have to downgrade the database schema to reach migrateCASchema
Still happening on Nix 2.17.0. Occurred when running nixos-rebuild switch upon adding ca-derivations feature. FS is ZFS.
sudo lsof /nix/var/nix/db/big-lock
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nix-daemo 392341 root 3uR REG 0,29 0 134 /nix/var/nix/db/big-lock
nix-daemo 995083 root 3u REG 0,29 0 134 /nix/var/nix/db/big-lock
nix-daemo 995677 root 3u REG 0,29 0 134 /nix/var/nix/db/big-lock
nix-daemo 996238 root 3u REG 0,29 0 134 /nix/var/nix/db/big-lock
nix-daemo 997332 root 3u REG 0,29 0 134 /nix/var/nix/db/big-lock
sudo pkill nix-daemon did not stop the processes, however sudo pkill -KILL nix-daemon and then sudo systemctl start nix-daemon seems to…be fine? I can build CA drvs.
Doing a nixos-rebuild boot + reboot instead, should also fix the deadlock.