bees icon indicating copy to clipboard operation
bees copied to clipboard

Endless loop when first started with empty filesystem (0.6.1)

Open charles-dyfis-net opened this issue 6 years ago • 5 comments

This was discovered when writing a NixOS system test for the bees module.

The testing logic in https://github.com/NixOS/nixpkgs/blob/923a3e4970226293e4698e44e3e5d5ccf7487603/nixos/tests/bees.nix consistently succeeds every time: This code first creates files on a new filesystem, then starts the bees service. That (passing) test is roughly equivalent to the following shell script:

any_shared_space() {
  [[ $(btrfs fi du -s --raw "$@" | awk 'NR>1 { print $3 }' | grep -E '^0$' | wc -l) -eq 0 ]]
}
die() { echo "$*" >&2; exit 1; }

mkfs.btrfs -f -L aux /dev/vdb || die
mount /dev/vdb /home || die
dd if=/dev/urandom of=/home/dedup-me-1 bs=1M count=8 || die
cp --reflink=never /home/dedup-me-1 /home/dedup-me-2 || die
any_shared_space /home/dedup-me-* && die "ERROR: Detecting shared space before any deduplication has been done"
sync
systemctl start [email protected]
sleep 10
any_shared_space /home/dedup-me-* || die "ERROR: No shared space detected even after bees is running"

By contrast, a test akin to the following -- which starts the service after the filesystem is created and initially mounted, but before any content has been created -- consistently fails, with bees running in a loop which is trying to poll the status of a file descriptor referring to a file that doesn't exist:

any_shared_space() {
  [[ $(btrfs fi du -s --raw "$@" | awk 'NR>1 { print $3 }' | grep -E '^0$' | wc -l) -eq 0 ]]
}
die() { echo "$*" >&2; exit 1; }

mkfs.btrfs -f -L aux /dev/vdb || die
mount /dev/vdb /home || die
systemctl start [email protected]
sleep 1
dd if=/dev/urandom of=/home/dedup-me-1 bs=1M count=8 || die
cp --reflink=never /home/dedup-me-1 /home/dedup-me-2 || die
sync
sleep 10
any_shared_space /home/dedup-me-* || die "ERROR: No shared space detected even after bees is running"

The actual failing test can be found at https://github.com/charles-dyfis-net/nixpkgs/blob/bees-test-failing/nixos/tests/bees.nix; if checking out the relevant branch of nixpkgs, it can be run (from the root of that working tree) with nix-build -I nixpkgs="$PWD" ./nixos/tests/bees.nix.

strace of the loop taking place when in that failed state can be seen at https://gist.github.com/charles-dyfis-net/34ac2e4d2bada0c3a3c8632cab98c8d9

charles-dyfis-net avatar Nov 27 '18 04:11 charles-dyfis-net

The title is a bit misleading, as the filesystem is not quite empty even in the failing case -- we've still had a .beeshome subvol created by the wrapper, after all.

charles-dyfis-net avatar Nov 27 '18 04:11 charles-dyfis-net

Do you have the log output from the failing case? Among other things it will have the context leading up to the exception.

Zygo avatar Nov 27 '18 05:11 Zygo

Edited an additional file log.txt into the previously-linked gist.

charles-dyfis-net avatar Nov 27 '18 11:11 charles-dyfis-net

So it's failing to find the extent tree in the root tree. Fascinating.

Maybe there's some race condition during the first btrfs mount where transid 1 or the extent tree doesn't exist yet?

In any case, some other things are going wrong after that: somehow we get a transid_max of 18446744073709551615 (maybe need something like 90132182fdb2c06a2811ce9d45c54046adaf536d for transid_max()). After that, nothing works because we're assuming the largest possible transaction rate and won't accept any smaller transid than the maximum possible transid.

It should be possible to work around this by forcing a sync after creating .beeshome.

Zygo avatar Nov 27 '18 18:11 Zygo

Confirmed: Adding that sync (as in charles-dyfis-net/nixpkgs@5ed17dfb01e82db9f78c2f78a694c321ba07c697) does indeed work around the immediate issue.

charles-dyfis-net avatar Nov 27 '18 18:11 charles-dyfis-net