bees icon indicating copy to clipboard operation
bees copied to clipboard

What amount of work is redone after reboot?

Open kakra opened this issue 5 months ago • 19 comments

Issues responses and docs say that bees redoes up to 15 minutes of work after being interrupted by a reboot, shutdown or kill.

But does it really?

I have one system that seems to be stuck in the past:

PROGRESS:
extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
----- -------- ------ ------- ------- ---------------- ------- ----------------
  max 910.625G 437731     228  101099 2025-06-02 10:40   9w 1d 2025-09-24 17:38
  32M 307.927G 372675     228   63792 2025-05-20 08:32  15w 1d 2025-11-05 16:15
   8M  90.383G 147998     228  210631 2025-07-10 13:32   9w 5d 2025-09-29 00:48
   2M  25.092G 153541       0     228 2025-04-28 17:36  1y 14w 2026-11-01 14:11
 512K   9.497G 143094       0     228 2025-04-28 17:36  1y 20w 2026-12-11 21:18
 128K   1.475G 000030       0     228 2025-04-28 17:36       -                -
total   1.313T        gen_now  241971                  updated 2025-07-22 11:21

With every reboot, the remaining time only increases by weeks. The gen_min value never increases. From the logs, it looks like bees is going through the same files and snapshots again (this system uses snapper, especially it starts working at snapper snapshot 55, the oldest snapshot, over and over again).

Does this mean that after a restart, bees will start scanning at gen_min again? Then, why is gen_max so high? Wouldn't it be better if it doesn't use huge spans of generations for gen_minand gen_max, so it has a chance to reach gen_max before shutdown and start the next cycle in time?

Or is there another value which keeps track of the progress, like objectid?

root 250 objectid 856341495808 offset 0 min_transid 228 max_transid 101099 started 1748853653 start_ts 2025-06-02-10-40-53
root 251 objectid 787283578880 offset 0 min_transid 228 max_transid 63792 started 1747722724 start_ts 2025-05-20-08-32-04
root 252 objectid 198128205824 offset 0 min_transid 228 max_transid 210631 started 1752147134 start_ts 2025-07-10-13-32-14
root 253 objectid 242523701248 offset 0 min_transid 0 max_transid 228 started 1745854593 start_ts 2025-04-28-17-36-33
root 254 objectid 216195731456 offset 0 min_transid 0 max_transid 228 started 1745854593 start_ts 2025-04-28-17-36-33
root 255 objectid 2221010944 offset 0 min_transid 0 max_transid 228 started 1745854593 start_ts 2025-04-28-17-36-33

This system has been restored from backup in April, with snapper snapshots created fresh. Bees has been running right since the first snapper snapshot.

kakra avatar Jul 24 '25 06:07 kakra

Here is a simplified btrfs. It contains only 32 blocks. Initially, it's empty:

    [________________________________] transid 0
     ^ objectid 0                   ^ objectid 31

Note: the block addresses are usually called bytenr in btrfs, but for historical reasons bees uses objectid, because the code that writes beescrawl.dat was adapted from code where objectid was the inode number. We'll use objectid here, but it's really bytenr.

When we write three blocks, we create a new btrfs transaction:

    [AAA_____________________________] transid 1
    [________________________________] transid 0
     ^ objectid 0                   ^ objectid 31

And we do that 4 more times:

    [_________DDD____________________] transid 4
    [______CCC_______________________] transid 3
    [___BBB__________________________] transid 2
    [AAA_____________________________] transid 1
    [________________________________] transid 0
     ^ objectid 0                   ^ objectid 31

If bees starts scanning at transid 4, it will scan this rectangle:

    objectid 0-31
    transid 0-4

which looks like this when we look only at the "top" transid (or when we read the filesystem normally):

    [AAABBBCCCDDD____________________] transid 0-4
     ^ objectid 0                   ^ objectid 31

beescrawl.dat records the boundaries of the scanning rectangle--the object ID range and the transid range. Note there's only one objectid recorded, because the high end of the objectid range is always 0xffffffffffffffff.

When something is deleted, its bytenr is reused, but its transid is not. Here's what it looks like if we delete part of CCC and write new data there:

    [______EEE_______________________] transid 5 --- not scanned, out of transid range 0..4
    [_________DDD____________________] transid 4 \
    [________________________________] transid 3  \
    [___BBB__________________________] transid 2   > bees scanning this rectangle
    [AAA_____________________________] transid 1  /
    [________________________________] transid 0 /
     ^ objectid 0                   ^ objectid 31

Note that when bees gets to objectid 6, where the "CCC" used to be, it will skip over "EEE", because the "EEE" extent is tagged with transid 5, and that's out of range. This takes a very small amount of time--bees can skip tens of thousands of extents per second.

If we keep writing while bees is scanning, we have something like this:

    [________________________III_____] transid 9  \
    [_____________________HHH________] transid 8   \
    [__________________GGG___________] transid 7    > bees skips this rectangle
    [_______________FFF______________] transid 6   /
    [______EEE_______________________] transid 5  /
    [_________DDD____________________] transid 4  \
    [________________________________] transid 3   \
    [___BBB__________________________] transid 2    > bees scans this rectangle
    [AAA_____________________________] transid 1   /
    [________________________________] transid 0  /
     ^ objectid 0                   ^ objectid 31

When bees gets to "DDD" with transid 0-4, the next search operation skips to the end of the filesystem: transid 5-9 data is not in range. This does require a few seconds of IO to read and discard the extent tree pages.

bees's next scan rectangle is objectid 0-31, transid 5-9:

    [________________________III_____] transid 9  \
    [_____________________HHH________] transid 8   \
    [__________________GGG___________] transid 7    > bees scans this rectangle
    [_______________FFF______________] transid 6   /
    [______EEE_______________________] transid 5  /
    [_________DDD____________________] transid 4  \
    [________________________________] transid 3   \
    [___BBB__________________________] transid 2    > bees skips this rectangle
    [AAA_____________________________] transid 1   /
    [________________________________] transid 0  /
     ^ objectid 0                   ^ objectid 31

This search is assisted by btrfs, which can skip old metadata pages and whole branches of the extent tree. So there isn't even a delay of a few seconds here--bees skips immediately to EEE, the first extent in transid 5-9, and skips over DDD (transid 4, out of range) to FFF (transid 6, in range).

Once bees has caught up, the scans search only the top of the transid stack:

    [____________SSS_________________] transid 1003   \
    [_____________________VVV________] transid 1002    > bees scans this part in transid 1001-1003
    [___PPP__________________________] transid 1001   /
    [OOO___QQQRRR___TTTUUU___WWWXXXYY] transid 1000   - already scanned
    [  ... etc ...                   ] ...

Here bees scans PPP, SSS, and VVV, then goes idle to wait for more data.


With a more concrete example:

    PROGRESS:
    extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
    ----- -------- ------ ------- ------- ---------------- ------- ----------------
      max 910.625G 437731     228  101099 2025-06-02 10:40   9w 1d 2025-09-24 17:38
      32M 307.927G 372675     228   63792 2025-05-20 08:32  15w 1d 2025-11-05 16:15
       8M  90.383G 147998     228  210631 2025-07-10 13:32   9w 5d 2025-09-29 00:48
       2M  25.092G 153541       0     228 2025-04-28 17:36  1y 14w 2026-11-01 14:11
     512K   9.497G 143094       0     228 2025-04-28 17:36  1y 20w 2026-12-11 21:18
     128K   1.475G 000030       0     228 2025-04-28 17:36       -                -
    total   1.313T        gen_now  241971                  updated 2025-07-22 11:21

objectid is represented by point in this table, which normalizes the objectid. Instead of 0-31, the point runs from 000000 to 999999, regardless of the size of the filesystem, or how much unallocated space it has.

Visually, the above looks like this:

    [___________________________EEEEE] transid 241971
    [____________________DDDDDDD_____] transid 210631
    [______________CCCCCC____________] transid 101099
    [_______BBBBBBB__________________] transid 63792
    [AAAAAAA_________________________] transid 228
    [________________________________] transid 0
     ^point 0                       ^ point 999999

Here is what each crawler is scanning in the current cycle:

Size 0-2M:

            vvvvvvvvvvvvvvvvvvvvvvvvv---------------- will be skipped at the end
    [AAAAAAA_________________________] transid 228
    [________________________________] transid 0

Size 2M-8M:

                                vvvvv---------------- will be skipped at the end
    [____________________DDDDDDD_____] transid 210631
    [______________CCCCCC____________] transid 101099
    [_______BBBBBBB__________________] transid 63792
     ^^^^^^^----------------------------------------- already done, not scanned again

Size 8M-32M:

                   vvvvvvvvvvvvvvvvvv----------------- will be skipped at the end
    [_______BBBBBBB__________________] transid 63792
     ^^^^^^^----------------------------------------- already done, not scanned again

Size 32M-max:

                         vvvvvvvvvvvv----------------- will be skipped at the end
    [______________CCCCCC____________] transid 101099
    [_______BBBBBBB__________________] transid 63792
     ^^^^^^^----------------------------------------- already done, not scanned again

When these crawlers get to the end of the data in their transid range, they'll take the next "slice" of the filesystem:

Sizes 0-2M next slice:

    [___________________________EEEEE] transid 241971
    [____________________DDDDDDD_____] transid 210631
    [______________CCCCCC____________] transid 101099
    [_______BBBBBBB__________________] transid 63792
     ^^^^^^^----------------------------------------- already done, not scanned again

Sizes 2M-8M next slice:

    [___________________________EEEEE] transid 241971
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^--------------------- already done, not scanned again

Sizes 8M-32M next slice:

    [___________________________EEEEE] transid 241971
    [____________________DDDDDDD_____] transid 210631
    [______________CCCCCC____________] transid 101099
     ^^^^^^^^^^^^^^---------------------------------- already done, not scanned again

Size 32M-max:

    [___________________________EEEEE] transid 241971
    [____________________DDDDDDD_____] transid 210631
     ^^^^^^^^^^^^^^^^^^^^---------------------------- already done, not scanned again

The small extent scanners (2M and below) are slow because:

  • bees prioritizes large extents over small extents if there aren't enough threads to scan all six sizes at once.

  • the small extents are slower because they are small. The 32M-max extents advance by a minimum of 32M each, so they are going to move point about 64x faster than the 512K-2M extents that might only advance by 512K at a time.

  • big files contain a lot of extents of all sizes; however, in any random collection of extents, there are numerous small ones and fewer large ones. This prevents small extents from being processed in parallel, since the btrfs dedupe operation locks both files involved in the dedupe.

  • each time a new snapshot includes old extents, it increases their reference count by one. If the extents are then found to be duplicate, it takes more time to eliminate all the reflinks. (Non-duplicate extents are read once and then all their other references are skipped). This further slows down the scanning on the oldest extents, which in this case are also the smallest.

I also note that these scans are taking a simply amazing amount of time to get done--months or years per TiB. Slow scans are fairly common on filesystems with metadata on spinning drives. Here's a LOGICAL_INO call that is running right now for 12 minutes and it's still not done:

    tid 26136: ref_1ce00638000_204K_275: [761.27s] resolving addr 0xb30bfe4000 with LOGICAL_INO

That's 12 minutes for one 204K extent, and it has 275 references. That's a scan rate of two months per TiB. Fortunately, you have less than 10 GiB in that size tier, so it won't impact overall free space very much.


Now for the "15 minutes" question.

Recall that beescrawl.dat records the scan rectangle as two points:

(minimum objectid, minimum transid)
(maximum objectid, maximum transid)

However, because multiple extents are being scanned in parallel and those extents may conflict on files, the scanning cannot always proceed in perfect order. Instead, what gets stored in beescrawl.dat (and reported in the progress table's point column) is the lowest objectid among all currently running or queued scan tasks. As a result, we can have "stragglers"--tasks lagging behind while others have progressed further.

Suppose we have three files:

[AAAACCCCEEEEGGGGIIIIKKKKMMMMOOOO] file 1
[AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHH] file 2
[JJJJLLLLNNNNPPPPQQQQRRRRSSSSHHHH] file 3

We then start these scan tasks in parallel:

  • AAAA -- locks file 1 and 2
  • BBBB -- locks only file 2
  • CCCC -- locks file 1 and 2
  • DDDD -- locks only file 2
  • ... (other extents)
  • IIII -- locks only file 1
  • ...
  • RRRR -- locks only file 3
  • SSSS -- locks only file 3
  • HHHH -- locks file 2 and 3

As these tasks acquire locks and follow the wait/die protocol, they quickly settle into a tree of dependencies--a "single-threaded queue tree." For example:

AAAA -- blocks everything
  +- BBBB, CCCC, DDDD, EEEE, FFFF, GGGG, HHHH (file 2)
  |                                        \- HHHH (file 3)
  \- JJJJ, LLLL, NNNN, PPPP, QQQQ, RRRR, SSSS (file 3)

But, when a duplicate is found at HHHH, the tree may be resorted:

AAAA -- blocks everything
  +- HHHH (file 2), BBBB, CCCC, DDDD, EEEE, FFFF, GGGG (all blocking file 2)
         \- HHHH (file 3), JJJJ, LLLL, NNNN, PPPP, QQQQ, RRRR, SSSS (all blocking file 3)

Now, HHHH will be processed before CCCC, EEEE, DDDD, etc., even though some of those are lower objectids. Also, since this tree now has a single root, bees will try to start more scan tasks, seeking extents in other files that can keep all worker threads busy. This can mean that bees continues scanning other extents (e.g. TTTT, UUUU, ... ZZZZ), in other files while we're still waiting for point to progress past AAAA and HHHH. As long as there's either unlocked extents or free memory, bees just keeps going.

The problem: Out of all these parallel extents, only one (the lowest) can be recorded in beescrawl.dat. At first, this is reasonable--AAAA is both the lowest and the first being processed. Later, we may be processing a high extent like HHHH, while lower extents (BBBB, CCCC, DDDD, EEEE, FFFF, GGGG) are still queued, and non-conflicting extents like TTTT and ZZZZ are completed and removed from the queue.

If this behavior persists, the gap between the largest and smallest active extent can widen dramatically--sometimes for days. If bees is restarted during this time, it resumes from AAAA, even if it had already processed up to SSSS or ZZZZ before the interruption.

There are a few shortcuts: if BBBB through SSSS are unique, their hashes are inserted into the hash table. bees will see their entries in the hash table and skip them on restart--but only after rereading them, so the time saved is minimal.

TL;DR:

  • current progress is saved every 15 minutes, but
  • progress updates may be delayed, sometimes by days.

Zygo avatar Jul 26 '25 05:07 Zygo

The queue length is controlled in should_throttle. You can try a shorter length with a patch like this:

diff --git a/src/bees-roots.cc b/src/bees-roots.cc
index 54956f766..38e3ad5bf 100644
--- a/src/bees-roots.cc
+++ b/src/bees-roots.cc
@@ -704,7 +704,7 @@ should_throttle()
        // If there's too many entries in the queue, stop adding new ones until workers catch up
        // If there's not too many entries in the queue, restart the scan task
        const auto instance_count = Task::instance_count();
-       const auto instance_limit = BEES_MAX_EXTENT_REF_COUNT;
+       const auto instance_limit = 100;
        // Add some hysteresis so that we aren't constantly flipping throttle on and off
        const bool queue_empty = s_throttled && instance_count < instance_limit * .90;
        const bool queue_full = !s_throttled && instance_count > instance_limit * .99;

Note that if the queue is shorter, you'll get less parallelism, which means the overall scan time will be longer. A short queue ends up single-threaded for a significant portion of the time.

If you need to very frequently stop and restart bees, lowering the queue length might be the only practical way to make forward progress.

The queue must have some extra space. Don't set it below about 10 or so, or it will be throttled by the crawl tasks and never unthrottle.

Zygo avatar Jul 26 '25 05:07 Zygo

If you need to very frequently stop and restart bees, lowering the queue length might be the only practical way to make forward progress.

Yeah, I think that's exactly what's happening. I don't fully grasp all the details you outlined but I appreciate that a lot. It's a great documentation for the future. What I get is the queue dependency that may be involved, or rather why that leads to making no progress if bees doesn't run long enough.

So yes, I may try that on that particular system. I wonder if it will work with the existing crawl state, or if I need to change that to shorten the "window size" (aka instance_limit?).

I also wonder if there could be a way that bees implements some sort auf auto-tuning for that value, dynamically changing during runtime, instead of hard-coding it.

Is 100 a good starting value? I don't want to get too low. Would 1000 be too high already?

Previously (still back in v0.10 or earlier days), I had a similar issue with starting bees on a system with a lot of snapshots it couldn't catch up with. You suggested to just change some column to have the same number. That worked. But times of that snapshot scanner are gone, and I understand that this work-around cannot work here.

But still, I probably need to adjust the beescrawl.dat if I adjust instance_limit to make it work effectively immediately?

The dataset in question is an archive of DVD contents for a big German financial software suite. The archive contains yearly updated product versions, a set of two DVD disk contents per year (not an ISO but the individual files from the ISO). We keep these, because sometimes updating that software asks for a package from a two year old version to fix some dependencies in the installer - and customers usually don't have that (because the distributor says old versions are no longer needed) or buried in a messy pile of other DVD in the safe. The contents itself have a lot of very similar files each year because products only see delta changes. Many ZIP files of the package repository stay the same.

It is exactly that what bees finds and deduplicates for a long time: those ZIP files, it finds over and over again in different snapper snapshots, sometimes with different names but identical contents, or mostly identical contents. The crawler state file doesn't progress since weeks, the extents are usually big. My concern is not with the very small extents, I know that this is slow. But even the "max" crawler doesn't seem to progress, instead it shows a huge transid range it wants to scan.

According to your description, I probably ran into a very bad scan dependency problem. But I wonder why that even could happen, because:

  1. bees could see the snapshots right from the start of the first snapper snapshots, bees has been running all the time since then
  2. those archive files didn't change in the last 2+ years
  3. the system has been restored from backup within the last year, starting with 0 snapper snapshots
  4. just after that observation, I cp ---reflinked the archive to a dedicated subvol which snapper doesn't snap, and removed the original files
  5. step 4 didn't help because bees was still scanning and deduping within old snapshots so I decided to do a defrag on those cp --reflinked files to at least create bigger extents and let the old snapper snapshots fade away
  6. the archive is mirrored to a NAS using syncthing (which supports btrfs reflinks for block dedup) but I don't see any problem here because essentially contents have never changed, no deltas have been synced, no syncthing-reflink-block-reuse...

This is all fine and I do not care about a perfect dedup a lot, I just want bees to get some progress and not starting almost from scratch on every reboot. The reboots are usually some weeks apart (although I needed to reboot a little bit more often in the last month).

kakra avatar Jul 26 '25 06:07 kakra

Meanwhile, the progress table looks like this:

extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
----- -------- ------ ------- ------- ---------------- ------- ----------------
  max 914.835G 433229     228  101099 2025-06-02 10:40   9w 4d 2025-09-30 06:37
  32M 314.346G 370494     228   63792 2025-05-20 08:32  15w 5d 2025-11-11 18:21
   8M   86.29G 146182     228  210631 2025-07-10 13:32  11w 3d 2025-10-12 22:25
   2M  26.879G 152529       0     228 2025-04-28 17:36  1y 16w 2026-11-17 14:49
 512K   10.12G 142168       0     228 2025-04-28 17:36  1y 22w 2026-12-29 00:13
 128K    1.53G 000031       0     228 2025-04-28 17:36       -                -
total   1.322T        gen_now  247184                  updated 2025-07-24 08:37

This looks like progress is going backwards (433229 vs 437731 earlier) but I can probably explain that because I restarted bees after re-building from latest git to get your latest improvements.

gen_now clearly moved forward but the progress table is stuck, and will be for at least another 10 weeks and probably more.

kakra avatar Jul 26 '25 06:07 kakra

Is 100 a good starting value? I don't want to get too low. Would 1000 be too high already?

100 is a good starting point. If it makes things slow, bump to 1000. There's not much change until you add or remove zeros.

But still, I probably need to adjust the beescrawl.dat if I adjust instance_limit to make it work effectively immediately?

There's no need to adjust beescrawl.dat - it records the lower bound of the queue. Instance limit affects the upper bound, which isn't recorded anywhere.

One possibility, if we get rid of the beescrawl format, is to save the entire queue (it will have at most 10,000 entries, so it can be a list of 10,000 64-bit numbers, which is a reasonable size for a file written every 15 minutes).

But even the "max" crawler doesn't seem to progress, instead it shows a huge transid range it wants to scan.

The max crawler has most of the data, and it looks like it's making steady (albeit slow) progress.

Remember that size is represented (loosely) by point, not transid.

transid represents time. If everything was restored very quickly, then most of the data is in those first 230 transids. The next 200000 transids represent 3 months of new data.

You could look at the detailed debug log, particularly at skip lines:

2025-07-26 02:14:12.013 11643.11671<7> ref_4e080aea2000_2.168M_2: Too many bytes copied (1.465M) for bytes deduped (720K) and holes punched (0), skippin
g extent
2025-07-26 02:14:12.013 11643.11671<7> ref_4e080aea2000_2.168M_2: Too many dedupe (116) and copy (116) operations, skipping extent
2025-07-26 02:14:12.015 11643.11671<6> ref_4e080aea2000_2.168M_2: skip: 2.168M 116d116c0p 0.568s {0x4e080aea2000+0x0} 0x25332c000 [+=+++++=>++++++=+++++
+++=+++=+=++++++++++++=+=>+=+=dd>+=++=>++=>+=+=d>+=+++=++++=>+=>+=+=++=++++++++++=++++++++=>+=>+++=+++=++++++++=++++++==>+++=+++++=+=dd>+++++++++=++++=+
+=>++=+++++++=>++++=>+++=+++++++=+=+=>+=+++++=++++=+=+=>++++++++++++++++=>+++=>+++=+=>++=+=++++=++++=+++=+=>+=+==dd>+++++=d>++=d>+++++++=++++++=++=d>+++
+++++=>+=++=+=>++++=+=+=+++++=>+=++=++=>++++=++++++=+=+++++++++++=++++++=+++=d>++++=++=>+=++=+++++=+++=++++++++=d>++=++++=++=d>+=++++++++++++++=dddd>+==
+=+=++++==>+++=++++++==+=+=dddd>++++++=+=++=++++=+=+=d>++=+=++=+=+++=++=+=+=+=] 0x253557000 /path/to/file

The limit on copy/dedupe operations is 100 per extent...maybe if you're hitting 99 all the time, it's making things too slow? Try cutting the Too many dedupe limit in half.

Zygo avatar Jul 26 '25 07:07 Zygo

I applied the patch with value 100 to the affected system. I'll let it run for 1-2 weeks, let's see if it changes anything.

Update: After some hours, it now seems to hash more files in parallel. Is that a possible side effect of the change?

kakra avatar Jul 28 '25 11:07 kakra

it now seems to hash more files in parallel. Is that a possible side effect of the change?

Maybe. It reduces the maximum amount of work done to dedupe each extent, which means more of the time is spent on other tasks like hashing or idling.

Zygo avatar Jul 28 '25 14:07 Zygo

@Zygo So after a few days with this patch and a reboot, it looks like bees catches up back to the previous "point" value much faster - essentially within minutes instead of hours. I wonder if this is an intended effect or a side-effect of previous work done with the new value.

kakra avatar Aug 07 '25 09:08 kakra

Are you running a version that includes 5c0480ec594028a6ce47432bf2824a658ee3eabf (and the previous commit)? Tracking the head of the queue produces more stable estimates.

There will be some skipping of partially deduped extents due to hash table entries that are saved from partially deduped extents--but the "point" shouldn't move backwards unless bees is killed without saving a checkpoint first. That's one of the behaviors fixed in that commit.

Zygo avatar Aug 09 '25 07:08 Zygo

Are you running a version that includes 5c0480e (and the previous commit)?

Yes.

After a reboot, "point" starts lower as it was left on shutdown. But it then catches up much faster since I applied that "instance limit 100" patch. We should keep in mind, tho, that I didn't start from the original situation of the btrfs image (I have no means of reverting it back to exactly the same state). So the faster progress may also be just coincidence because bees passed a certain critical point.

"point" for "max size" still lacks behind the previous value before the reboot but it's close. The other lines show the same or higher "point" value now. But I can confirm that after reboot, "value" always starts lower than it was on shutdown.

It seems only that specific system does it, my other systems are working fine and follow the latest gen_max fast. So something specific to that one btrfs seems to be very hard for bees to grind through.

I now started to run duperemove on the snapshots and data directories that seem to slow down bees. But duperemove starts to use huge huge (huge!) amounts of memory on that dataset: It sees 30 mio files if including the snapshots, with 3 TB of data (the filesystem has 2 TB of allocated data overall).

My thinking was: duperemove hashes and dedupes full extents only while bees rewrites extents - so running both on that data should clean up enough to make bees faster. But duperemove isn't able to even properly checksum all of the snapshots - it can reliably only do one snapshot at a time. I've collected enough files in its database now that even reading the file list from it consumes huge amounts of memory. So it looks like despite using SQlite, it reads everything into memory first.

kakra avatar Aug 09 '25 13:08 kakra

I now started to run duperemove on the snapshots and data directories that seem to slow down bees. But duperemove starts to use huge huge (huge!) amounts of memory on that dataset: It sees 30 mio files if including the snapshots, with 3 TB of data (the filesystem has 2 TB of allocated data overall).

duperemove's memory usage is proportional to the total size of references, not size of data. The ratio of memory usage to data size is large (not as big as ZFS, but orders of magnitude larger than bees).

My thinking was: duperemove hashes and dedupes full extents only while bees rewrites extents - so running both on that data should clean up enough to make bees faster.

That isn't quite what happens. duperemove relies on FIEMAP, so it often gets incorrect extent boundaries. duperemove makes a high-effort attempt to dedupe large, logically contiguous regions of data--and when that doesn't align with the true physical extent boundaries, it leaves behind unreachable blocks instead of freeing up space.

If you want zero-copy dedupe, you can set the copy limit to zero in bees (note this is untested):

diff --git a/src/bees-context.cc b/src/bees-context.cc
index 46742ad18..a6fd82fa9 100644
--- a/src/bees-context.cc
+++ b/src/bees-context.cc
@@ -777,10 +777,17 @@ BeesContext::scan_one_extent(const BeesFileRange &bfr, const Extent &e)
                force_insert = true;
        }
 
-       // Don't do nuisance dedupes part 2:  nobody needs more than 100 dedupe/copy ops in one extent
-       if (dedupe_list.size() + copy_list.size() > 100) {
-               BEESLOGDEBUG("Too many dedupe (" << dedupe_list.size() << ") and copy (" << copy_list.size() << ") operations, skipping extent");
-               BEESCOUNT(scan_skip_ops);
+       // Don't do nuisance dedupes part 2:  nobody needs more than 100 dedupe ops in one extent
+       if (dedupe_list.size() > 100) {
+               BEESLOGDEBUG("Too many dedupe (" << dedupe_list.size() << ") operations, skipping extent");
+               BEESCOUNT(scan_skip_dedupe);
+               force_insert = true;
+       }
+
+       // Don't do nuisance dedupes part 3:  zero copies in one extent
+       if (copy_list.size() > 0) {
+               BEESLOGDEBUG("Too many copy (" << copy_list.size() << ") operations, skipping extent");
+               BEESCOUNT(scan_skip_copy);
                force_insert = true;
        }
 

...although 0 is very pessimistic - maybe try 10 or so. You don't want to miss easy copy cases, like when two extents are identical except one is longer.

This doesn’t quite work as intended because it’s decided per reference, not per extent, so it can allow unproductive dedupes (e.g. moving some refs from extent A to B without removing either because a copy is discovered too late). Fixing that is part of the next milestone.

But duperemove isn't able to even properly checksum all of the snapshots - it can reliably only do one snapshot at a time.

duperemove attempts to read and sort the entire data set before it can do any productive work, which is a non-starter with large data sets. Most dedupers that run on btrfs work this way, and that limitation is why I had to write my own.

I've collected enough files in its database now that even reading the file list from it consumes huge amounts of memory. So it looks like despite using SQlite, it reads everything into memory first.

Yeah, SQlite is only meant to cache inserted hashes between runs. SQlite is not used for the actual dedupe--that's an rbtree in memory, last time I checked.

Inserting hashes into a balanced tree data structure will only scale as long as that structure fits in RAM. After that, every hash insert or lookup will force swapping (hashes are random, and trees only work well when the inserts and lookups have high locality). Once memory is full, the program runs a million times slower because what should be a DRAM row access is now a trip out to a random page of NVMe and back.

SQlite for dedupe has all the same problems as an rbtree--except much worse, because of SQlite's extra overhead compared to C pointers and swap, and requirements for ordered writes and metadata writes to storage devices. Swap doesn't care about write order and has no metadata, but an SQlite database will be unreadable without both.

It is possible to fix that in duperemove: read the files, make the hashes, write them to a file, merge-sort the file, then read the file in order and do the dedupes. An rbtree pays the sorting cost on every insert, while a sorted external file pays the sorting cost only once. Merge sorts are all sequential access (they're used on tape drives) so they don't suffer swap thrashing. A merge sort can even compress the data to cut back on memory usage. I made a few dedupers that scale up to several terabytes based on merge sort--before bees came along with even harsher performance requirements.

Fun fact: bees was started in 2014 as a project to fix duperemove one problem at a time. By the first public bees release in 2016, all the original duperemove code was gone.

Zygo avatar Aug 09 '25 20:08 Zygo

hmmmkay, somehow I've edited your comment instead of posting my own. Yay github UI! 🚱

Zygo avatar Aug 09 '25 20:08 Zygo

hmmmkay, somehow I've edited your comment instead of posting my own. Yay github UI! 🚱

Unedited... It keeps revisions of each change. :-)

kakra avatar Aug 10 '25 01:08 kakra

That isn't quite what happens.

Well, I let fdupes recombine all the files in the snapshots (none of the files that bees was chunking through has been changed between each snapshot, at least not the content, but somehow the extents have diverted). I now stopped it because it was mostly finding the files already deduped.

Looking at bees, it made a huge jump forward now over the weekend:

extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
----- -------- ------ ------- ------- ---------------- ------- ----------------
  max 905.448G 516451     228  101099 2025-06-02 10:40   9w 2d 2025-10-15 23:41
  32M 295.045G 433820     228   63792 2025-05-20 08:32  15w 3d 2025-11-27 20:17
   8M  81.912G 207979     228  210631 2025-07-10 13:32  17w 2d 2025-12-10 19:34
   2M   55.33G 161930       0     228 2025-04-28 17:36  1y 25w 2027-02-04 08:10
 512K   14.66G 150956       0     228 2025-04-28 17:36  1y 31w 2027-03-23 08:23
 128K   7.605G 000036       0     228 2025-04-28 17:36       -                -
total   1.328T        gen_now  297121                  updated 2025-08-11 10:42

kakra avatar Aug 11 '25 08:08 kakra

Some reasons why point jumps around:

  • Adding data: If total bytes D grows while deduped bytes N stays the same, point = (999999 * N/D) goes down. (Your snapshots of the progress table show D increasing.)
  • Removing data: If you delete not-yet-deduped data, D drops and N/D goes up. If you delete already-deduped data, N drops and N/D goes down. You may not see the effect until the crawler reaches that location, then point tends to skip over it, so point goes up quickly.
  • Allocated vs used: point and total measure allocated space (e.g., a 1 GiB block group counts as 1 GiB whether full or empty). A balance can repack allocation and shift both N and D noticeably. point moves in whichever direction pulls harder.
  • New data after crawl start: Data with transids outside the crawler’s range is excluded for the current pass, so you can see point leap from ~75% to 100%. Next pass races from 0-75% quickly, then slows--hence we avoid calling point a smooth “percent complete.”
  • Queue variance: Extent queue times vary by ~6 orders of magnitude, so future workload is hard to predict.

This all makes point and ETA extremely noisy.


On slow HDD‑metadata filesystems, bees isn’t “stuck” if the crawl data from beescrawl.dat--(min_transid << 64) | objectid--advances at least once per day. The daily cadence is the real pain: long queues mean infrequent checkpoint updates and longer post‑reboot catch‑ups.

The actionable fixes are:

  • Shorter queues and smaller per-extent work to make checkpoint updates more frequent. These reduce performance, in time or space respectively. I'd rather not rely on these approaches, except as quick hacks on v0.11 code, or as configurable options that users can enable.
  • Checkpoint the whole queue instead of just the lowest extent bytenr. This means it wouldn't matter how long the queue gets, because we'd just save the whole thing on checkpoints and load it on reboots. This requires a new on-disk format to replace beescrawl.dat.

In beescrawl.dat, objectid is the next‑extent bytenr from btrfs; it doesn’t depend on total data size. When updates happen, (min_transid << 64) | objectid should increase monotonically. The ProgressTracker class enforces this.

Algorithms in bees are acyclic and use bounded memory, so progress always moves forward; only the rate changes. True "stuck" states would show up as explosive memory leaks, deadlocks, or a saturated queue that never drains--i.e., obvious failure modes rather than silent stalls.

Zygo avatar Aug 11 '25 21:08 Zygo

Yeah, I think by deduplicating, I only moved work to queue from the current transid range to a future one. But until then, some of the old snapshots may be gone. We will see.

kakra avatar Aug 11 '25 22:08 kakra

FWIW, applying duperemove to the snapshots that bees took an eternity on, sped it up considerably:

extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
----- -------- ------ ------- ------- ---------------- ------- ----------------
  max    1.05T 896222  101099  317613 2025-08-18 10:48 22m 39s 2025-08-18 14:27
  32M 203.004G 718380     228   63792 2025-05-20 08:32   5w 8h 2025-09-22 23:00
   8M  33.162G 374189     228  210631 2025-07-10 13:32   9w 2d 2025-10-22 20:23
   2M  10.171G 192997       0     228 2025-04-28 17:36  1y 14w 2026-11-29 05:55
 512K    3.04G 168217       0     228 2025-04-28 17:36  1y 26w 2027-02-22 14:54
 128K   2.992G 000059       0     228 2025-04-28 17:36       -                -
total   1.297T        gen_now  317998                  updated 2025-08-18 14:04

This is after a reboot, it caught up fast. Bees is running with the patch you suggested in https://github.com/Zygo/bees/issues/320#issuecomment-3121332249.

duperemove, OTOH, didn't finish. I eventually killed it after some days. But it still caused bees to make faster progress. It now deduplicates files from more recent snapshots which seems to be faster overall.

kakra avatar Aug 18 '25 12:08 kakra

FWIW, applying duperemove to the snapshots that bees took an eternity on, sped it up considerably:

If duperemove had any effect here, it was a minor one. Duperemove only removes physical extents, which means bees won’t need to process them, but the stats show that extent removal accounts for just 3–10% of the dataset (30–110 GiB). That’s not enough to explain a 300-400 GiB jump in bees progress.

The more likely cause is bees itself. If you've modified bees for a shorter queue and lower copy limit, it speeds up both ends of the queue. Once the slowest extents clear out, bees can skip rapidly over the extents between the front and back of the queue. Those reread skips are slower than checkpoint skips, but still much faster than dedupe, because bees just checks the hash table and moves on. That fits with the size of the jump: hundreds of GiB read quickly, but with no new dedupe work.

There’s also an important milestone visible in the latest progress report:

extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
----- -------- ------ ------- ------- ---------------- ------- ----------------
  max    1.05T 896222  101099  317613 2025-08-18 10:48 22m 39s 2025-08-18 14:27
total   1.297T        gen_now  317998                  updated 2025-08-18 14:04
-
  max 905.448G 516451     228  101099 2025-06-02 10:40   9w 2d 2025-10-15 23:41
total   1.328T        gen_now  297121                  updated 2025-08-11 10:42

The largest crawler has completed its second pass, and may soon go idle (start and ETA are on the same day). When that happens, it stops competing for bandwidth with the smaller crawlers.

Even without attributing cause yet, we can already see a big change in the ETAs for the mid-size tiers:

extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
----- -------- ------ ------- ------- ---------------- ------- ----------------
  32M 203.004G 718380     228   63792 2025-05-20 08:32   5w 8h 2025-09-22 23:00
   8M  33.162G 374189     228  210631 2025-07-10 13:32   9w 2d 2025-10-22 20:23
total   1.297T        gen_now  317998                  updated 2025-08-18 14:04
-
  32M 295.045G 433820     228   63792 2025-05-20 08:32  15w 3d 2025-11-27 20:17
   8M  81.912G 207979     228  210631 2025-07-10 13:32  17w 2d 2025-12-10 19:34
total   1.328T        gen_now  297121                  updated 2025-08-11 10:42

The 32 M and 8 M tiers collapsed from 15–17 weeks to 5–9 weeks in the span of a week. That’s strong evidence that bees is running faster than it was, but whether it’s from reread skips, queue length changes, or reduced copy limits will take some more observation to confirm.

So duperemove shaved a small slice off the dataset, but the main progress last week may have come from reread skipping over work bees had already done. We may soon see a further acceleration as the largest crawler reaches its idle state.

Zygo avatar Aug 18 '25 18:08 Zygo

Earlier, I removed your suggested patch from bees again, and let it run without it. It now finished:

extsz   datasz  point gen_min gen_max this cycle start tm_left   next cycle ETA
----- -------- ------ ------- ------- ---------------- ------- ----------------
  max 862.714G 000340  343373  343374 2025-08-27 14:22       -             idle
  32M 223.353G 000207  343373  343374 2025-08-27 14:22       -             idle
   8M  96.228G 000105  343373  343374 2025-08-27 14:22       -             idle
   2M  51.003G 000050  343373  343374 2025-08-27 14:22       -             idle
 512K   23.69G 000000  343373  343374 2025-08-27 14:22       -             idle
 128K  39.011G 000000  343373  343374 2025-08-27 14:22       -             idle
total   1.266T        gen_now  343374                  updated 2025-08-27 14:22

For some reason, bees progressed faster when I helped eliminating duplicate extents with duperemove. I've kept always the same hash database but let it scan the snapshots which bees task log showed long activity on (THREADS in run file). I then killed duperemove when it started to create too much system load. It looks like bees then stopped working on those extents, and later when it encountered them again in a later generation, it skipped them faster.

kakra avatar Aug 27 '25 12:08 kakra