btrfsmaintenance Periodic deduplication using BEES

Looking at all the possible features that BTRFS has, I was impressed to discover how much deduplication can matter. Sadly, there is no easy way for users to configure a periodic deduplication.

Looking at all the periodic scripts that this project provides, periodic deduplication might be a right fit for it.

Details

I would propose the use of BEES, and not Duperemove because BEES works on a block level. It will therefore not be as efficient as a file-based deduplication system, but it would be more suitable for general usage.

As with most of the scripts within this project, it would be worthwhile to look for a sensible default that doesn't drain to many resources while running.

Sep 23 '24 09:09 Eonfge

I think BEES may not suitable for the maintenance task (otherwise it's a great tool), or it could be tricky to configure and set up so it can run. Other tasks are simply start and wait until it ends, wihle Bees has many tunables. It seems to me it would be more suitable for custom user configuration because the /etc/sysconfig way is too simple. Also Bees comes with it's .service and beesd.

This could be done but I'm not sure how exactly to define the common use cases and provide reasonable configuration options or presets.

Jan 23 '25 17:01 kdave

Regarding duperemove, it's similar to the defrag task but again the number of options can make it hard for configuration in the /etc/sysconfig directory. The problem is that each path could require its own set of command line options and this is becomes cumbersome to track in shell variables.

Jan 23 '25 17:01 kdave

An idea: provide templates for .service or .timer units with various predefined use cases and descriptions. This can be a middle ground between the customizability and a single source of almost-ready-to-use units where a user can pick a suitable one and install it to systemd. If the units are simply stored in this git repo there's not much risk breaking existing systems.

Jan 23 '25 17:01 kdave

To give you some inspiration, here is how I do it right now:

Example

For myself, I have created my own service in ~/.config/systemd/

dedupe-projects.sh

#!/usr/bin/env bash
set -euo pipefail

fdupes -r /home/kevin/Projects | duperemove --fdupes
fdupes -r /home/kevin/.cache | duperemove --fdupes

./user/dedupe-projects.service

[Unit]
Description=Dedupe Projects

[Service]
Environment="PATH=/usr/lib/ccache/bin:/usr/local/sbin:/usr/local/bin:/usr/bin"
ExecStart=/home/kevin/.config/systemd/dedupe-projects.sh

[Install]
WantedBy=default.target

./user/dedupe-projects.timer

[Unit]
Description=Run Dedupe Projects weekly

[Timer]
OnCalendar=weekly
Persistent=true

[Install]
WantedBy=timers.target

discussion

What directories should be deduplicated?
What is a reasonable schedule?
How many options are reasonable to configure?
- Performance
- Hash file

I can imagine a few flavours:

Home directories, once a week
System directories, once a month
Home directories, once a month, no hash file
System directories, once a year, no hash file
Home directories, once a week, low priority
System directories, once a month, low priority

Jan 31 '25 08:01 Eonfge