add stat-based include/exclude
there are tickets about doing file size based exclusion: #902, jborg/attic#330
file size is a stat result attribute, so this is a special case of a stat-based rule.
also in stat result:
- timestamps: atime, ctime, mtime
- type and mode
- uid / gid
so we could add a mechanism to define inclusion / exclusion rules not only based on the file's path/name (as we already have), but also based on comparing stat attributes to given values.
If somebody is finding this searching for a solution/workaround for borg 1.0/1.1:
- you can exclude "known-big" files by a name-based pattern, like
*.iso(or their directory, like.../Virtualbox VMs/*). - you can use
findunix tool to create a list for--exclude-fromborg option
Temporarily excluding big files is especially useful for a initial backup(s), which might take a while.
Note: the first implementation could just limit the scope to size-based include/exclude (but when writing the code, do it in a way that e.g. timestamp-based can be easily done also).
you can use
findunix tool to create a list for--exclude-fromborg option
Beware of race conditions, though (i.e. large files appearing after you've generated the list).
I have a proposal for how this could be implemented. Rather than a global CLI flag like exclude-by-size, it could be added as a special borg-patterns prefix that is applied at the individual pattern level. This would make it quite flexible -- you could apply the rule to certain files/directories only, use it with include patterns, etc.
There's already logic in place to handle prefixes (for R and P) so adding another one should be simple and backwards compatible. I propose calling it F for "filter". The prefix would be followed by a filter-type specifier, any arguments needed for the filter, and finally the pattern to apply the filter to.
So, to exclude files over 100M from Downloads folders, you would write:
F size > 100M -/Users/*/Downloads
To exclude files over 1G everywhere, you could add this to the command line:
borg ... --pattern='F size > 1G -**'
For other stat filters, just replace size with mtime, mode, etc.
I have more thoughts, including how to combine multiple filters together, but wanted to put this out there first. What do you think of the proposal? (If it's well received I may take a stab at implementing it.)
In the end, guess this will need boolean expressions.
Operators and, or and not.
And the terms in these expressions would be stuff like:
- size < 100M
- mtime < 1d
- user == joe
See man find about what people want to potentially find (not sure all make sense for backups) and how find does it.
As a borg backup archive is usually expected to be a full archive containing all the files in the input data set, guess the first step is to look at what makes sense.
One obvious thing is being in a hurry and wanting to make a quick first backup, ignoring huge files (like having important little documents and less important *.iso).
Other use cases?
I think for an initial version of this, we could keep it really simple and not worry about boolean operators. I imagine use cases for them would be relatively rare. By the nature of how patterns combine, or can be already achieved by just writing two separate rules.
And multiple negated include rules (+) can be used to achieve a rough version of and. For example, to exclude user == joe && size > 1M, you could write:
F user != joe + **
F size <= 1M + **
- **
(It's not quite the same as an actual and operator when other rules are involved, since borg stops processing rules once a single match is made, but it's probably Good Enough for now.)
Other use cases?
The main ones that come to mind are:
- Excluding really large files in general. Protect against accidentally adding a multi-GB VM image, for example, when you know the files you actually care about backing up will be much smaller.
- My downloads directory tends to accumulate a few random things that would be nice to back up, but I want to exclude large files.
Another use case that occurred to me: filtering output from borg list. You may want to check a particular archive (or iterate over all archives) and find files matching certain criteria. Examples:
- Looking for files modified modified on a particular day
- Searching all past archives for files over a certain size, to see what's taking up space in the repo
Although I guess this use case can already be accomplished by using borg mount and find.
Yeah. Also this is a bit different to implement (one has to look at archived metadata vs. at stat() metadata from fs).
it is now (master branch, later borg 1.2) possible to feed find output (paths) into borg instead of using borg's builtin recursion.
so you can do all matching/selecting that is possible via find.
it is now (master branch, later borg 1.2) possible to feed
findoutput (paths) into borg instead of using borg's builtin recursion. so you can do all matching/selecting that is possible viafind.
Could you please give more details on this or some link to this function? I was searching changelog for "find" keyword without success.
He's referring to the unix find command.
He's referring to the unix find command.
Of course, I know that. But how can I use it to filter files and directories to backup? Now the only solution I can think of, is to put output of my specific find command into file, each line with added specific pattern selector (borg help patterns), preferably pf: and load that file as pattern file using --pattern-from or --exclude-from arguments.
Will there be a more elegant solution?
borg create
--paths-from-stdin
or
--paths-from-command
See there: https://borgbackup.readthedocs.io/en/1.2.0b3/usage/create.html
related: #4972
#8895 changed borg a bit: it now reads the simple stat attrs as well as xattrs and ACLs early, before processing file content.
It now has some hardcoded stuff for Linux and macOS standard "no backup" xattrs and also the NODUMP bsdflag is now handled there.
Instead of hardcoding it, there could be either a CLI interface or some other sort of include/exclude "rule".