backintime icon indicating copy to clipboard operation
backintime copied to clipboard

Can't include specific files from excluded wildcard match (rsync: ordering of --exclude/--include) - Workaround exist

Open insubstudios opened this issue 2 years ago • 8 comments

i'm trying to exclude most of my dotfiles and include a few specifically but it's not working.

i added /home/xin/.* to my excludes and to my includes:

  • 📁 /home/xin/.ssh
  • 📁 /home/xin/.fonts
  • 📄 /home/xin/.bashrc
  • 📄 /home/xin/.gitconfig
  • 📄 /home/xin/.gitignore

the folders are getting backed up but not the single files.

i checked the logs and the rsync command generated by BiT is putting the includes after the excludes. i copy/pasted it, moved the includes up, and ran it manually and it worked as expected.

Original rsync call from BIT logs (causing the problem)

from backintime log (line breaks added for clarity):

[I] rsync --recursive --times --devices --specials --hard-links --human-readable -s --links --perms --executability --group --owner --info=progress2 --no-inc-recursive --delete --delete-excluded -v -i --out-format=BACKINTIME: %i %n%L --link-dest=../../20230405-165753-308/backup --chmod=Du+wx
	--exclude=/media/xin/bigbox-6-A/xin
	--exclude=/home/xin/.local/share/backintime
	--exclude=.local/share/backintime/mnt
	
	--include=/home/xin/
	--include=/home/
	--include=/home/xin/.fonts/
	--include=/home/xin/.ssh/
	
	--exclude=.gvfs
	--exclude=.cache/*
	--exclude=.thumbnails*
	--exclude=.local/share/[Tt]rash*
	--exclude=*.backup*
	--exclude=*~
	--exclude=.dropbox*
	--exclude=/proc/*
	--exclude=/sys/*
	--exclude=/dev/*
	--exclude=/run/*
	--exclude=/etc/mtab
	--exclude=/var/cache/apt/archives/*.deb
	--exclude=lost+found/*
	--exclude=/tmp/*
	--exclude=/var/tmp/*
	--exclude=/var/backups/*
	--exclude=.Private
	--exclude=/home/xin/.*
	
	--include=/home/xin/**
	--include=/home/xin/.fonts/**
	--include=/home/xin/.ssh/**
	--include=/home/xin/.gitignore
	--include=/home/xin/.gitconfig
	--include=/home/xin/.bashrc
	
	--exclude=*
/ /media/xin/bigbox-6-A/xin/backintime/offish/xin/1/new_snapshot/backup

Expected rsync call modified order of include/exclude (solving the problem)

modified that works as expected:

rsync --recursive --times --devices --specials --hard-links --human-readable -s --links --perms --executability --group --owner --info=progress2 --no-inc-recursive --delete --delete-excluded -v -i --out-format=BACKINTIME: %i %n%L --link-dest=../../20230405-165753-308/backup --chmod=Du+wx
	--exclude=/media/xin/bigbox-6-A/xin
	--exclude=/home/xin/.local/share/backintime
	--exclude=.local/share/backintime/mnt
	
	--include=/home/xin/
	--include=/home/
	--include=/home/xin/.fonts/
	--include=/home/xin/.ssh/
	
	--include=/home/xin/.gitignore
	--include=/home/xin/.gitconfig
	--include=/home/xin/.bashrc
	
	--exclude=.gvfs
	--exclude=.cache/*
	--exclude=.thumbnails*
	--exclude=.local/share/[Tt]rash*
	--exclude=*.backup*
	--exclude=*~
	--exclude=.dropbox*
	--exclude=/proc/*
	--exclude=/sys/*
	--exclude=/dev/*
	--exclude=/run/*
	--exclude=/etc/mtab
	--exclude=/var/cache/apt/archives/*.deb
	--exclude=lost+found/*
	--exclude=/tmp/*
	--exclude=/var/tmp/*
	--exclude=/var/backups/*
	--exclude=.Private
	--exclude=/home/xin/.*
	
	--include=/home/xin/**
	--include=/home/xin/.fonts/**
	--include=/home/xin/.ssh/**
	
	--exclude=*
/ /media/xin/bigbox-6-A/xin/backintime/offish/xin/1/new_snapshot/backup

insubstudios avatar Apr 06 '23 16:04 insubstudios

Please have a look at #1419

There is no concrete solution for your problems but further resources to read. The pattern matching rules of rsync are not easy.

EDIT: Took some time to understand your problem. We have to take a closer look into the related rsync rules. I assume that there is a solution for your case without modifying BITs behavior in ordering include/excludes.

EDIT2: This pattern matching with rsync always gives me a headache.

buhtz avatar Apr 06 '23 20:04 buhtz

Hi, thanks for the response :-)

i found this article helpful too: https://zingbretsen.com/blog/rsync-include-exclude that's how i figured out how to edit the command and have it work when running manually.

i went down a bit of a rabbit hole last night, ran a bunch of rsync commands, and i'm pretty sure i have it figured out. sorry this a bit of an info dump.

i believe it is a simple internal fix and is half the solution to #561

Expected Behavior:

  • included files should always be included in backup
  • included folders should always be included in backup
  • children of included folders should be included unless excluded
  • excluded files should be excluded unless explicitly included
  • excluded folders should be excluded unless explicitly included
  • children of excluded folders should be excluded unless explicitly included

rsync: include & exclude

  • once something is included ~~it is in unless excluded~~ it cannot be excluded. (edit)
    • it's children will be included unless excluded, hence the universal exclude at the end. (edit)
  • once something is excluded it cannot be included.
  • for rsync to include something, it needs to "see" it. so all of the parent directories need to be included as well.

The order of two or more include clauses doesn't seem to matter. Similarly for two or more exclude clauses. But the relative order of include and exclude is really important.

backintime

  • includes are input and stored as files or folders and are handled differently by the rsyncInclude() method in snapshots.py
    • rsyncInclude() outputs two sets for the first and second include chunks (see below)
  • excludes are input by pattern, file, or folder. all are stored as strings and have no special handling by the rsyncExclude() method.

The rsyncSuffix() method builds the include/exclude part of the rsync command run for snapshots. This is grouped into 5 alternating chunks:

  1. exclude backup location and BiT config files
  2. include1: currently: folders and parent folders (should include files)
  3. exclude the exclude list
  4. include2: currently: files and folder children, /path/to/folder/** (should only be folder children)
  5. universal exclude, --exclude=*

Because BiT isn't adding the included files until after the exclude rules, they cannot be added. They need to be added to the first include chunk.

This can easily be done by changing line 2012 from:

                items2.add('--include={}'.format(folder))

to:

                items1.add('--include={}'.format(folder))

insubstudios avatar Apr 07 '23 18:04 insubstudios

Thanks a lot for analyzing and breaking this down. This will help a lot.

I'm not well into rsync behavior. And I'm always scared that modifying this will influence BITs default behavior and break someones other backups.

Because of that I'm very pedantic about a fix. I would like to dive deep into that rabit hole and understand all the details of how rsync behavior. That is how I can be "sure" about no one else is harmed by such a fix. It will take time I don't have currently. There are three other maintainers in the team but their time is also limited.

But I add the high priority label and the next-release-milestone to that Issue.

buhtz avatar Apr 07 '23 19:04 buhtz

Thank you. Understandable. There's never enough time.

insubstudios avatar Apr 07 '23 21:04 insubstudios

Workaround

i found a workaround for now:

add the include parameters into "additional options" under the "Expert Options" tab in settings.

this is inserted in the rsyncPrefix (defined in common/tools.py) so it is inserted before any other inlcude or exclude arguments.

make sure it doesn't match your backup location, ~/.local/share/backintime, or .local/share/backintime/mnt as that will probably cause problems. those are probably excluded first for a reason.

rsyncPrefix is used by snapshots.py for restore, backupConfig, and takeSnapshot and by sshtools.py for checkRemoteCommands. there doesn't seem to be any problems for restore and backupConfig. haven't tested the ssh stuff because i don't have that as part of my set up.

rsyncSuffix is only used by takeSnapshot.

insubstudios avatar Apr 13 '23 18:04 insubstudios

I have recently spent a little time on this subject of resolving the logic of the Include and Exclude configuration. I have what I think is a correct solution with a more easily followed logic, and I'm writing some description and explanation before I push it to a branch of my public repo for consideration. It solves this issue #1420 and #561 and meets all of the expectations identified by @insubstudios above.

I very much agree with the concern about quietly sabotaging someone's backup by subtly changing the logic of how their configuration is applied. As one thing to help with that, the branch (not yet pushed) includes unit tests where we can easily add more test cases of whatever combinations of includes, excludes, and subject files and try them out with both the current and proposed strategy.

DerekVeit avatar Aug 16 '24 20:08 DerekVeit

I've pushed the branch that I mentioned above: https://github.com/DerekVeit/backintime/tree/path_selections

With the exception of not yet trying to handle EncFS, it is working (I believe), but I don't consider it PR-ready. Given the significance of the change, it needs to be looked at more critically than usual, especially since I'm a new contributor. Some parts of it are maybe not for inclusion, e.g. a somewhat ad hoc log function for the testing code. So I'm just hoping for some review and feedback.

The main change is at Snapshots.rsyncSuffix to change how the --include= and --exclude= options get added to the rsync call. The previous/existing code adds them in a certain order that only works most of the time. The new strategy sorts the paths of the included and excluded paths in such a way that more-specific path specifications always have priorty.

By following a strategy that more-specific specifications should prevail, I believe this should not only fix #561 and #1420 but also be more reliably correct in general.

There are integration-level tests (test_rsync_selections.py) that check various cases, including for #561 and #1420. They call the Snapshots.backup method. For now, the revised rsyncSuffix still uses the original strategy unless a SELECTIONS_MODE attribute of the Config object exists and is set to "sorted". The tests use this to test cases with both strategies, original and sorted.

To make it easy to test various cases of existing file structures and include/exclude configurations, there is a test/filetree.py module for describing file structures with a string, e.g.:

"""
    var/
        log/
            dmesg
            syslog
"""

And the test cases are specified in a text format that uses those strings, e.g. in test/selection_cases. This is to make it easy to add more cases and also to make it very clear what each case is testing.

:exclude-name
    includes
        /var
    excludes
        log
    files_tree
        var/
            lib/
                foo
                log
            log/
                dmesg
                syslog
    expected_tree
        var/
            lib/
                foo

The tests use pytest for parametrize and fixtures, and there are fixtures defined in test/conftest.py for the Config object and the Snapshots object.

For some verbose logging from the test functions there is a simple log function in test/logging.py.

The new code does not yet include handling the EncFS feature. This might be as simple as adding encode = self.config.ENCODE and applying that in the right places as in rsyncInclude and rsyncExclude, but I haven't looked at that closely enough to see what I need for testing it, so I have left it alone so far. In this respect, the new code is definitely not ready. (I know the consideration of removing EncFS is only a maybe-sometime idea for now.)

I'm interested in BiT maintainers' thoughts, questions, and opinions.

SELECTION_MODE is probably only for temporary development testing purposes and can be eliminated, and rsyncSuffix might be simplified. The note I put in the docstring might be addressed or removed.

Although pytest is the default test runner for Back In Time, I know the existing tests are based on unittest. I like pytest, but consistency might be better. I could revise these tests to use generic.TestCase instead. I would then want to add the ddt library to provide the parameterizing.

The pathSelections method is more sprawling than I would like.

The log function is not essential to anything. I find it useful to have easy, verbose logging from my tests, especially while working on them, but it can be removed.

DerekVeit avatar Oct 04 '24 00:10 DerekVeit

Hello Derek, thank you very much for your investment and contribution. Don't be shy. 😄 I would suggest that you open a PR, add your explanation in its comment field and mark it as "Draft". As a PR it is easier for us to organize. As an Issue it would drown between all the other issues.

buhtz avatar Oct 04 '24 06:10 buhtz