restic
restic copied to clipboard
When reading stdin restic detects changes in the same data
Output of restic version
restic 0.9.6 compiled with go1.13.4 on linux/amd64
How did you run restic exactly?
docker exec percona-mysql-5.7 mysqldump --all-databases --skip-dump-date | restic --repo /tmp/restic-repo/ --password-file /tmp/pass backup --stdin
Files: 0 new, 1 changed, 0 unmodified
Dirs: 0 new, 0 changed, 0 unmodified
Added to the repo: 301 B
processed 1 files, 828.729 KiB in 0:00
snapshot fa88e084 saved
What backend/server/service did you use to store the repository?
Locally on Ubuntu 18.04 with ext4 filesystem.
Expected behavior
If there were no changes in databases (dumps are identical), restic should report that there are no changes (or 1 file is unmodified).
Actual behavior
If run the command several times, restic reports that 1 file was changed despite that fact that dumps are identical.
Steps to reproduce the behavior
- Deploy MySQL on Docker.
- Dump a database and pass output to restic with the
--stdinoption.
Do you have any idea what may have caused this?
As far as I understand, restic checks timestamps (atime, mtime, ctime) for a file and probably this mechanism cases the issue. But it is not clear how it works for stdin.
Do you have an idea how to solve the issue?
Probably a flag to ignore timestamps is needed.
@andrey1vanov Thanks for filling out the issue template!
As you suspect, the matter here is that restic checks ctime to detect file changes. If the ctime changed it will consider the file changed, and process it.
In your case, restic will process the file, but it will not store an additional copy of it in the repository, because it has already stored a copy of this data in previous backup runs. This is due to deduplication - it will only save one copy of the same data.
So, effectively this just takes a little more processing time. Is that a concern for you? I don't think restic will be changed so that the reporting of which files changed depends on the result of the deduplication rather than the metadata of the files.
Here's what I think is the relevant code (from cmd/restic/cmd_backup.go):
var targetFS fs.FS = fs.Local{}
if opts.Stdin {
if !gopts.JSON {
p.V("read data from stdin")
}
filename := path.Join("/", opts.StdinFilename)
targetFS = &fs.Reader{
ModTime: timeStamp,
Name: filename,
Mode: 0644,
ReadCloser: os.Stdin,
}
targets = []string{filename}
}
@rawtaz
So, effectively this just takes a little more processing time. Is that a concern for you?
My concern is that despite the fact that we sent the same MySQL dumps (the same data) to stdin, nevertheless, we see that something was changed in that data. So, it leads to false expectations.
My proposal in case with stdin is to ignore ctime or allow to ignore it explicitly with --ignore-ctime option as an example.
@andrey1vanov But how is this an actual problem for you?
The normal goal by backing up data is to get the data backed up, and that is what you still get here. The new/changed files reporting is just statistics for you, so I fail to see how this is relevant enough be a problem.
Surely you have other means to know if your database contents changed for the day than looking at the report from a backup run?
The flag you propose would effectively be just for modifying how statistics is reported, it has no practical effect on the actual backup or deduce process in your use case. It's just cosmetics.
I don't think it's ok to accept statistics being inaccurate, and disagree that they are just "cosmetic". I use the statistics to get a feel for whether or not things are working as expected.
My latest example was one of my mounts dropped without me knowing, restic showed total_files_processed = 0, which quickly led to a fix.
@PrplHaz4 If you think it's more than a cosmetic problem, perhaps you or the OP can explain what the actual practical problem is?
I was saying statistics are not merely cosmetic.
Not intimately familiar w/this problem, but it sounds like anytime stdin is used it will look files have changed according to statistics, but they haven't - the only thing different is ctime.
Maybe the "fix" for this is accepting that fact and putting a note in the doc.
The feature request would maybe be something like you said about using dedupe results in statistics if ctime is ignored or by default when input is stdin (as there's no better way to determine if there have been changes to the data)?
printing file stats at all doesn't really make sense when using --stdin. maybe restic shouldn't print file stats at all in that case?
So instead of:
repository af7ba211 opened successfully, password is correct
Files: 0 new, 1 changed, 0 unmodified
Dirs: 0 new, 0 changed, 0 unmodified
Added to the repo: 305 B
processed 1 files, 2.897 KiB in 0:00
snapshot 946f084a saved
Maybe it should say:
repository af7ba211 opened successfully, password is correct
Added to the repo: 305 B
processed stdin, 2.897 KiB in 0:00
snapshot 946f084a saved
thoughts?
I think this is something that doesn't need fixing. It's pretty simple:
- How restic checks if other files (non-stdin ones) have changed or not is by the timestamp of the data, not by the contents of the data.
- When you in this specific case supply data to restic by stdin, you have not changed the data, but you have changed the timestamp. Simply because the timestamp of the "file" stdin is not the same as the last time you backed it up.
- In other words, the ctime for the stdin "file" that restic is backing up has changed. Just like when the timestamp for a regular file changed.
- What restic then does is process the file, and flag that the file was changed. Like any other program using timestamps for change detection would.
If you look at this as if stdin is/was a regular file that had its timestamp changed, you might see it as more normal – If a regular file's timestamp was changed, but the contents was still the very same as last time you backed it up, restic would do the very same thing as you are seeing here with stdin. There's no difference, except that the file happens to come in through stdin instead of residing on the filesystem.
If you want to change this, do you also want to change how restic treats regular files that has a changed timestamp but no changed contents? Probably not.
That's just my opinion. I personally wouldn't spend time "fixing" this. But it's not my decision, and I support whatever decision is made by the core devs 👍
There's a similar discussion on the topic of change detection at https://github.com/restic/restic/issues/2495 but it's a bit different in terms of issue/use case.
PS: I wish to note that we have yet to see a proper explanation of how this is an actual practical problem, i.e. what actual problem a "fix" would solve.
@rawtaz
But how is this an actual problem for you?
Every hour we create database backups and upload them to our S3 storage. So as to understand whether something was changed or not it is more convenient to analyze logs and find at what time we got changes instead of comparing every commit in restic repository. Also I imagine users' frustration when they saw that something was changed during a backup but restic diff shows no differences.
For me it is quite strange that restic checks ctime with the --stdin flag as there is an expectation that we operate data (or data flows) between processes rather than files. So, file attributes of the file descriptor should be ignored probably.
Also, if statistics were provided it should not be confusing. It should be clear, precise and provide an understanding of a situation for further analysis instead of showing us that something was changed but in fact it was not.
I don't think it's a major problem, but we should not print the stats for files/dirs when reading from stdin.
One more question about stdin. Why restic diff does not show that metadata for a snapshot was changed? There is U designation in case when metadata (access mode, timestamps, ...) for the item was updated. But I see the following:
comparing snapshot 5a0bba25 to 44744c0a:
Files: 0 new, 0 removed, 0 changed
Dirs: 0 new, 0 removed
Others: 0 new, 0 removed
Data Blobs: 0 new, 0 removed
Tree Blobs: 0 new, 0 removed
Added: 0 B
Removed: 0 B
@rawtaz @fd0 any thoughts about it?
I think this issue raises very practical concerns with the stdin handling. As you pointed out, stdin is not a file but rather just "data".
I think right now the project is mainly focused on traditional files, and the --stdin feature was just quickly added to improve usability. But as adoption of restic grows, it'd be amazing to have the tools to handle data from multiple stdin streams and make sure they're a first-class citizen in the backup. In my eyes, the current issue of misleading output from restic in special stdin use cases is just a symptom of the underlying decisions that maybe could be done even a bit more cleanly.
@fd0 is there a way to get rid of all logic that involves some sort of "file creation time" in restic?