restic icon indicating copy to clipboard operation
restic copied to clipboard

Sanitize tags when used as filenames by restic mount

Open greatroar opened this issue 4 years ago • 7 comments

What does this PR change? What problem does it solve?

Draft PR, untested.

Tags that are not valid filenames (and some that are) are now sanitized by restic mount for display in the tags/ directory.

The rule is explained in the comment:

// Some tags are problematic when used as filenames:
//
//     ""
//     ".", ".."
//     anything containing '/'
//
// We prepend a '%' to mark these as special, then replace '/' by "%_"
// and '%' by "%%". Because this produces ambiguity when the actual tag
// starts with '%', we do the same to any tag name already starting with '%'.

E.g., "foo/bar" becomes "%foo%_bar".

I chose '%' because it should be rare in actual tags and it's not a special character to common shells. The rule is ad hoc, but designed to pass through as many tags unchanged as possible. The resulting filenames are always valid on Linux; to my knowledge, FreeBSD allows the same filenames as Linux; but I didn't really consider MacOS much, which may or may not require valid (normalized?) Unicode filenames.

Was the change previously discussed in an issue or on the forum?

Should fix #3690, but I haven't tested it yet.

Checklist

  • [x] I have read the contribution guidelines.
  • [x] I have enabled maintainer edits.
  • [ ] I have added tests for all code changes.
  • [ ] I have added documentation for relevant changes (in the manual).
  • [ ] There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • [x] I have run gofmt on the code in all commits.
  • [ ] All commit messages are formatted in the same style as the other commits in the repo.
  • [ ] I'm done! This pull request is ready for review.

greatroar avatar Mar 31 '22 05:03 greatroar

Hm, why not just replace / by _ and _ by __? Isn't % a special character in cmd.exe, for environment variable expansion?

I find this easier to read:

  • foo/bar becomes foo_bar
  • foo_bar becomes foo__bar

fd0 avatar Mar 31 '22 06:03 fd0

The reasoning was to use something that nobody puts in their tags, so that as few tags as possible get mangled. Your suggestion is probably more reasonable, though. The remaining issue is then, what happens to "", "." and ".."? Maybe just skip those in the dir listing and issue a log message?

greatroar avatar Mar 31 '22 18:03 greatroar

Sorry if I'm dense, but instead of making hacky workarounds, why don't we simply disallow the offending character(s) in tags entirely?

That will keep things clean, remove the need for hacks, and it's IMO fully reasonable to not allow everything in certain types of data, in particular "identifiers" like tags which aren't really meant to consist of an essay of some language that require the entire Unicode charts to convey its meanings..

rawtaz avatar Mar 31 '22 18:03 rawtaz

@fd0 TagsDir.Lookup also needs to compute the inverse transformation, and your suggestion is not invertible: "foo__bar" could have been "foo_bar" or "foo//bar".

Maybe we need to approach this differently and just map / to _, then accept that tags/foo_bar holds snapshots labeled either foo_bar or foo/bar. It's just a navigational aid, after all.

(The alternative of creating a directory structure representing foo/bar/baz is more complicated and also can't handle the case of leading slash, trailing slash or multiple consecutive slashes.)

greatroar avatar Mar 31 '22 18:03 greatroar

@rawtaz I can imagine people backing up web data and using URLs as tags...

greatroar avatar Mar 31 '22 19:03 greatroar

This will be pretty easy to implement once #2913 is merged.

MichaelEischer avatar Jul 30 '22 19:07 MichaelEischer

I've rebased the PR resolve the conflicts with #2913.

Maybe we need to approach this differently and just map / to _, then accept that tags/foo_bar holds snapshots labeled either foo_bar or foo/bar. It's just a navigational aid, after all.

(The alternative of creating a directory structure representing foo/bar/baz is more complicated and also can't handle the case of leading slash, trailing slash or multiple consecutive slashes.)

The tree structure is what is now used since #2913 is merged. Multiple unexpected slashes will then be removed by the filename normalization, such that we can't really avoid aliasing there.

My preferred solution would be to just map slashes to underscores and ignore aliasing. The second commit does just that.

I though about implementing a fallback strategy to generate unique tag names by appending a counter in case conflicts exist. However, that leads to rather unintelligible tag names such as _ and _-1, which won't help anyone.

MichaelEischer avatar Aug 07 '22 12:08 MichaelEischer

Looks good :)

fd0 avatar Aug 18 '22 18:08 fd0