bees icon indicating copy to clipboard operation
bees copied to clipboard

Beesd scans FS_NOCOW_FL flagged files and directories.

Open kotenok2000 opened this issue 5 months ago • 2 comments

I have set /var/lib/boinc/.local to nodatacow.

Image

Image

kotenok2000 avatar Jun 12 '25 04:06 kotenok2000

The scan halts when the referring file is opened and its FS_NOCOW_FL flag checked; files with this flag set are not scanned or deduplicated. bees closes the file immediately.

Since btrfs does not store datacow/nodatacow status in the extent tree, bees must examine extent backrefs to determine nodatacow status. This happens after new extents are identified. The generation field on nodatacow extents are not updated when the extent is modified, so the nodatacow extents should be discovered only once after they are created, even if they are overwritten with new data.

I'll leave the issue open, because bees could fetch inode items earlier--before full path resolution--to skip heavier operations. In practice, though, these are minor compared to backref and data read costs, so performance benefit may be negligible.

Zygo avatar Jun 12 '25 05:06 Zygo

I think the confusion may come from thinking that bees walks the directory tree, and when it encounters a nocow directory, it would skip that directory and all contained files. But that's not how bees works. Bees walks the extents, and then finds the file to that extent (as @Zygo already explained above in more detail). If that file doesn't have the nocow flag, bees will process it.

It works a little bit different: If you set nocow on a directory, new files (and only new files) will inherit this flag and be actually created as nocow. Existing files can have that flag set, but it won't change the file to nocow.

If you need to convert a directory including all files to nocow, do something like this:

# stop all processes accessing files in the directory
mv THATDIR THATDIR.old
mkdir THATDIR && chattr +C THATDIR
rsync -av THATDIR.old/. THATDIR/. --remove-source-files
# check success
# then remove THATDIR.old (rsync may leave empty directories)
find THATDIR.old/ -type d -print0 | xargs -0 rmdir -p

This works because rsync will create new copies of the files, thus each file inherits the nocow flag.

Note: If THATDIR is a subvolume, create a new subvolume instead of mkdir.

kakra avatar Jun 12 '25 09:06 kakra

Bees walks the extents, and then finds the file to that extent. If that file doesn't have the nocow flag, bees will process it.

It's a little worse than that, because all processing in bees is done by extent, not by file. So the loop looks like this:

  1. Discover a new data extent
  2. Make a list of all references (subvol, inode, ioffset, eoffset, length) to the extent
  3. For each reference:
    1. make a list of filenames for the inode
    2. For each filename:
      1. open the file
      2. do some checks to make sure it's the right file
      3. if FS_NOCOW_FL, pretend the file doesn't exist
      4. cache the open file descriptor so we can skip step 3 next time
  4. read the data, look up hashes, maybe dedupe, etc.
  5. Repeat steps 1-4 with the next extent.

So if you have a nocow file with 1000 extents, you'll get the log message 1000 times. Worse, it's logged at INFO level, so it's mixed in with the dedupe events you might actually want to see. There's no caching at step 3.ii.c to say "don't try to open this file again." The open and file checks are not expensive, but they're not free either.

The proper fix is to insert a check before step 3 to see if the inode is nodatacow, and in that case skip the entire extent. No log messages to worry about by the time we get to step 3.ii.

nocow files are becoming increasingly prevalent, and more users are reporting this, so bees should handle them better.

Zygo avatar Jul 07 '25 21:07 Zygo

But in 3.ii. it leaves the loop immediately because there's no point in finding more files with that reference? If one file is nocow, all references should be...

kakra avatar Jul 07 '25 23:07 kakra

The next extent reference at step 3 will go back into the filename loop at 3.ii again, until all the extent references are done. After that, the next extent at step 1 that happens to be part of the same file will enter the loop at step 3 again.

There is a break right after the info message about FS_NOCOMP_FL, but that only stops opening more paths to the same inode (loop at 3.ii)--they'll all have the same FS_NOCOMP_FL because they're the same inode, so there's no point in further checking. The purpose of that loop is to find files when they have multiple hardlinks, but some have of the links have been deleted or renamed.

The loop at step 3 is a bit harder to see because it's split up into Task objects. A check could be done somewhere in BeesScanModeExtent::SizeTier::create_extent_map, where it's looking up all the references and preparing a Task to scan them. If it checks for FS_NOCOMP_FL (or the equivalent BTRFS_INODE_NODATASUM[1]) as it's building refs_list, it can simply return without launching the Task to process the list. Then there's no opens at all. Something similar can be done with subvol scans to skip to the next inode if the current one is nocow.

[1] it's actually BTRFS_INODE_NODATASUM that prevents reflinks, because an inode is not allowed to have some extents with csums and some without; however, the fsattr for nodatacow controls both SUM and COW bits.

Zygo avatar Jul 08 '25 00:07 Zygo