Periodic deduplication using BEES
Looking at all the possible features that BTRFS has, I was impressed to discover how much deduplication can matter. Sadly, there is no easy way for users to configure a periodic deduplication.
Looking at all the periodic scripts that this project provides, periodic deduplication might be a right fit for it.
Details
I would propose the use of BEES, and not Duperemove because BEES works on a block level. It will therefore not be as efficient as a file-based deduplication system, but it would be more suitable for general usage.
As with most of the scripts within this project, it would be worthwhile to look for a sensible default that doesn't drain to many resources while running.
I think BEES may not suitable for the maintenance task (otherwise it's a great tool), or it could be tricky to configure and set up so it can run. Other tasks are simply start and wait until it ends, wihle Bees has many tunables. It seems to me it would be more suitable for custom user configuration because the /etc/sysconfig way is too simple. Also Bees comes with it's .service and beesd.
This could be done but I'm not sure how exactly to define the common use cases and provide reasonable configuration options or presets.
Regarding duperemove, it's similar to the defrag task but again the number of options can make it hard for configuration in the /etc/sysconfig directory. The problem is that each path could require its own set of command line options and this is becomes cumbersome to track in shell variables.
An idea: provide templates for .service or .timer units with various predefined use cases and descriptions. This can be a middle ground between the customizability and a single source of almost-ready-to-use units where a user can pick a suitable one and install it to systemd. If the units are simply stored in this git repo there's not much risk breaking existing systems.
To give you some inspiration, here is how I do it right now:
Example
For myself, I have created my own service in ~/.config/systemd/
dedupe-projects.sh
#!/usr/bin/env bash
set -euo pipefail
fdupes -r /home/kevin/Projects | duperemove --fdupes
fdupes -r /home/kevin/.cache | duperemove --fdupes
./user/dedupe-projects.service
[Unit]
Description=Dedupe Projects
[Service]
Environment="PATH=/usr/lib/ccache/bin:/usr/local/sbin:/usr/local/bin:/usr/bin"
ExecStart=/home/kevin/.config/systemd/dedupe-projects.sh
[Install]
WantedBy=default.target
./user/dedupe-projects.timer
[Unit]
Description=Run Dedupe Projects weekly
[Timer]
OnCalendar=weekly
Persistent=true
[Install]
WantedBy=timers.target
discussion
- What directories should be deduplicated?
- What is a reasonable schedule?
- How many options are reasonable to configure?
- Performance
- Hash file
I can imagine a few flavours:
- Home directories, once a week
- System directories, once a month
- Home directories, once a month, no hash file
- System directories, once a year, no hash file
- Home directories, once a week, low priority
- System directories, once a month, low priority