TriBITS icon indicating copy to clipboard operation
TriBITS copied to clipboard

snapshot-dir.py: Assert that it will not run in a non-git repo or ignored dir

Open bartlettroscoe opened this issue 8 years ago • 5 comments
trafficstars

The current snapshot-dir.py script can delete all of the files in a badly chosen destination directory. This is because the rsync -cav --delete command will delete everything in targeted dest-dir directory that is not the same as the orig-dir. That is like running rm -r * if you are not careful. This problem was mentioned by @gsjaardema who uses this script to snapshot some directories between git repos (thanks Greg).

The help avoid problems like this, a few things would help:

  1. By default, assert that dest-dir is a non-ignored subdir of a git repo.

  2. Provide a --no-op option that will just print what commands would get run without actually doing anything.

  3. Clearly print what orig-dir and dest-dir will be interpreted as by rsync. This can be a little confusing if you are not used to rsync.

  4. Always add the / and the end of orig-dir and dest-dir so that it treats directories correctly.

bartlettroscoe avatar Dec 02 '16 02:12 bartlettroscoe

The first thing I tried was to snapshot to a directory that is not under a git repo. For this I did:

$ cd /tmp/
$ mkdir rabarlt-play
$ cd rabartl-play/
$ mkdir tribits/
$ cd tribits/
$ echo "file I want to keep" > file_i_want_to_keep.txt
$TRIBITS_DIR/python_utils/snapshot-dir.py --orig-dir=$TRIBITS_DIR/ --dest-dir=./

This actually erred out before running rsync with the error:

B) Assert that dest-dir is 100% clean with all changes committed

Not a git repository
To compare two paths outside a working tree:
usage: git diff [--no-index] <path> <path>
Traceback (most recent call last):
  File "/ascldap/users/rabartl/Trilinos.base/Trilinos/TriBITS/tribits/python_utils/snapshot-dir.py", line 73, in <module>
    success = SnapshotDir.snapshotDirMainDriver(sys.argv[1:], defaultOptions)
  File "/home/rabartl/Trilinos.base/Trilinos/TriBITS/tribits/python_utils/SnapshotDir.py", line 299, in snapshotDirMainDriver
    snapshotDir(options)
  File "/home/rabartl/Trilinos.base/Trilinos/TriBITS/tribits/python_utils/SnapshotDir.py", line 328, in snapshotDir
    "Location changes in the destination directory would be overritten and lost!")
  File "/home/rabartl/Trilinos.base/Trilinos/TriBITS/tribits/python_utils/SnapshotDir.py", line 429, in assertCleanGitDir
    workingDir = dirPath )
  File "/home/rabartl/Trilinos.base/Trilinos/TriBITS/tribits/python_utils/GeneralScriptSupport.py", line 497, in getCmndOutput
    raise RuntimeError('%s failed w/ exit code %d:\n\n%s' % (cmnd, errCode, data))
RuntimeError: git diff --name-status HEAD -- . failed w/ exit code 129:

So that use case is already caught by the assertion of the dest-dir and therefore is safely caught before any deletion occurs.

However, if you run this from an ignored subdir of a git repo, then it will delete everything. For example:

$ cd /tmp/rabartl-play/
$ git init .
$ echo "VCed file" > vced_file.txt
$ echo /tribits/ > .gitignore
$ git add .
$ git commit -m "first commit"
$ cd tribits/
$ ls file_i_want_to_keep.txt
file_i_want_to_keep.txt
$ $TRIBITS_DIR/python_utils/snapshot-dir.py --orig-dir=$TRIBITS_DIR/ --dest-dir=./

Now that actually did the rsync and then bombed out when it tried to do the commit. And it deleted my file:

$ ls file_i_want_to_keep.txt
[empty]

This was the use case that I tried yesterday and it did this evil thing. Here it deleted a file that was not under version control without warning.

@gsjaardema, what use case did you run where the snapshot-dir.py scirpt deleted a bunch of files that were not under version control? Was it this use case where you tried to snapshot into an ignored subdir of a git repo?

In any case, I can fix this problem by adding a call to git check-ignore to see if the dest-dir is an ignored subdir or not.

@gsjaardema,, is that enough to make this script safe for your use case? Will this use case catch "blown away parts of my installations many times due to forgetting a trailing / or specifying an incorrect argument"?

bartlettroscoe avatar Dec 02 '16 03:12 bartlettroscoe

Thinking about this more, I realized that in order to make the snapshot-dir.py script 100% safe for all arbitrary use cases by default that it must assert:

  1. dest-dir is a non-ignored subdir of a git repo.

  2. There are no ignored files in dest-dir or any subdir.

  3. There are no untracked files in dest-dir or any subdir.

If all of the above are true, then when rsync runs with the --delete option, it will only be deleting tracked and committed files (and it will not remove any protected files like .git/ files) so it can be undone by just doing git reset --hard HEAD^ to get rid of the snapshot commit.

In addition, to be 100% sure that every file that is snapshotted in from the orig-dir is traceable, one must assert:

  1. orig-dir is a non-ignored subdir of a git repo.

  2. There are no ignored files in orig-dir or any subdir.

  3. There are no untracked files in orig-dir or any subdir.

  4. The branch of the orig-dir is a tracking branch and there are no local commits for the orig-dir in the orig git repo that are not also in the remote tracking branch. (Or, the repo could be in a detached HEAD state and the commit must be in the remote repo.)

Those above checks ensure that the files in the orig-dir are under version control and that version is in a remote tracking branch. That ensures full tractability for every file copied and committed to dest-dir.

We can add options to skip any of the above checks in case that is what the user wants. But, in general, the above checks give the safest possible default behavior.

But note that the snapshot_tribits.py script would need to turn off the check of no ignored files since there are always ignored *.pyc files that get created by Python sitting in the tribits/ dir when it gets snapshotted. But since the tribits/ dir has .gitignore files that get copied over on the rsync, these will get ignored on the commit in the dest-dir.

I can also think of use cases where one would also want to allow untracked files to be present in orig-dir and dest-dir but they would have to request the removal of those checks.

bartlettroscoe avatar Dec 02 '16 04:12 bartlettroscoe

Here is another mistake that someone made recently that I never would have thought of. They tried to snapshot to and form the same directory! Apparently rsync does something pretty evil in that case!

Therefore, we need to put in a check to make sure that the orig-dir and dest-dir don't overlap!

bartlettroscoe avatar Dec 02 '16 21:12 bartlettroscoe

Save some todos for when I get back to this:

  • Change the name of snapshot-dir.py to snapshot_dir_to_git.py, copy in the contents of SnapShotDir.py and update automated tests (this will set up to be able to make script stand-alone and independent). ...

  • Clearly print orig-dir and dest-dir and assert they don't overlap ...

  • Automatically add '/' to the end of orig-dir and dist-dir if they are not already added (people get confused by this).

  • Add print out of what rsync command would be run when setting --skip-rsync.

  • ???

bartlettroscoe avatar Mar 07 '17 16:03 bartlettroscoe

It would be difficult to robustly determine if orig-dir or dest-dir was running in an ignored subdir of a git repo. It is easy to determine these are git repos because git commands are run in each to make sure they are clean and those commands would fail if you try to run them in a non-git repo.

Now that I have added the option --no-op in PR #450 and updated the documentation show here to provide several warnings about this including:

WARNING: Because this tool uses 'rsync --delete', it can be quite destructive
to the contents of dest-dir if one is not careful and selects the wrong
orig-dir or wrong dest-dir!  Therefore, please carefully read this entire help
message and especially the warnings given below and always start by running
with --no-op!

near the top of the --help output and longer warning messages below that like:

* Make sure that dest-dir is a non-ignored subdir of the destination git repo
  and does not contain any ignored subdirs or files that you don't care if
  they are deleted.  (Otherwise, the 'rsync --delete' command will delete any
  files in dest-dir that are not also in orig-dir and since these non-ignored
  files in dest-dir are not under version control, they will not be
  recoverable.)

This tool snapshot-dir.py is a sharp tool and if it is run on the wrong subdirs then it will be bad.

Note that is would be possible to use git ls-files . in orig-dir to get the exact list of files under version control under orig-dir to copy over. Then one can manually implement the copy of those files and the delete of the files that don't match in dest-dir. But that does not solve the problem of wiping out what is in dest-dir that you might want to keep.

bartlettroscoe avatar Feb 15 '22 21:02 bartlettroscoe