restic icon indicating copy to clipboard operation
restic copied to clipboard

Omit snapshot creation if there was no change

Open jayme-github opened this issue 8 years ago • 55 comments

Feature request/discussion about implementing a switch that omits snapshot creation if there was no change in metadata, and data.

From IRC:

is there a way to omit snapshot creation if there was no change at all? (I have a large dataset that does not chnange very often, like once a month, but I would like restic to run at least once a day) jayme: no, that's currently not possible with restic alone. every run of 'restic backup' will create a snapshot jayme: but you can easily script that: use 'find' to find files that have been modified since the last backup, if there are any run 'restic backup', otherwise do nothing fd0: thanks for your response. Do you think that feature is worth an issue? Or do you want that to stay out of restic? jayme: it wouldn't take much code to add this... I'm not sure if it's worth it though jayme: if you create an issue in the GitHub issue tracker, we can discuss it (and people can find it) jayme: we'd need to talk about what a "change" is for you jayme: only content? or metadata+content? jayme: what about a file that has the same content as before, but was moved and has a new inode? fd0: with "change" I ment "anything worh mentioning" e.g. move, metadata, content just want to avoid creating "empty" snapshots as thats probably a waste of space/time

jayme-github avatar Nov 07 '16 15:11 jayme-github

Maybe it would be a good idea to still create a kind of alias name.

Crest avatar Nov 07 '16 15:11 Crest

We'll need to define what "no change" means: "No files were added/removed and no files have different content" and/or "no metadata and no content has changed".

fd0 avatar Nov 07 '16 17:11 fd0

As discussed on IRC: Not making a new snapshot may interfere with the forget policy...

fd0 avatar Nov 07 '16 18:11 fd0

I'm curious though: Why do you need this functionality? What's your use case?

fd0 avatar Nov 07 '16 18:11 fd0

I'm curious though: Why do you need this functionality? What's your use case?

I just felt as it is unnecessary to clutter up the repo with snapshots that aren't of any use to me. My use case is a set of files that I want to backup like twice a day but they don't change often (once a month, even less than that maybe). That would leave me with ~59 "empty" snapshots a month in my repo probably slowing down operations (as it is a remote repository with high round-trip-times). I could run forget & prune regularly but that would cost round trips as well as API calls etc.

All in all this is of cause a "nice to have" as there are plenty of ways to work around this (or better: to correctly use restic :smiley:). I just wanted to bring that up as I thought about it and so might be others.

jayme-github avatar Nov 08 '16 07:11 jayme-github

Thanks for the explanation. What I had in mind when building restic and the repository structure was that a "snapshot" captures the state of the data at one point in time. If the data hasn't changed at all compared to any previous snapshot, an additional snapshot is very cheap and only uses a few hundred bytes and one additional file in the repository. You are right that more snapshots may slow operations down a bit, especially for high-latency remote backends, but I'm convinced this effect is negligible. If it isn't we can certainly optimize it (compare #523), but then I'd like to measure/benchmark first to get hard data :)

I'll close this issue for now, you can still add comments (and we can easily reopen it later).

fd0 avatar Nov 08 '16 08:11 fd0

Hi, first time restic user here, trying it out. It looks great so far. However, I was quite surprised restic creates empty snapshots if nothing changed and moreover that there is no flag to skip creation if there was no change. As a first time user, I expected this (ie. skip empty snapshots) to be the default behaviour, or at the very least have an option for it. Creating empty snapshots are counterintuitive to me, I don't really see the purpose of them (again: first time user, this is my first instinctive reaction).

Reading the IRC chat log it seems it wouldn't be much effort to add this. Could this be added as a flag to backup, so users could at least have a choice?

ignus2 avatar Oct 10 '17 01:10 ignus2

Although I'd never use such an option myself I'd like to chime in on the question what 'no change' should mean.

@fd0 stated

"No files were added/removed and no files have different content" and/or "no metadata and no content has changed".

IMHO the only valid choice here would be the "AND" choice:

  • No nodes (dirs, files, symlinks, devices, special devices etc.) were added or removed, AND
  • No files have different content, AND
  • No metadata (permissions, owner, group, ctime, mtime etc.) changed

On the first glance I'd thought there was some redundancy to this, as usually any content change would also induce a change of mtime. But on second thought there are always tools to set an ctime/mtime explictely so neither checking only the contents alone, nor checking only the metadata alone is enough. I am not 100% aware about atime semantics but by extrapolation I'd say the same thing should apply, so care must be taken to restore the atime after restic has read a file for checking its contents.

I believe there are some issues that ask for stats collection during a backup (e.g. #693, #874). I'd guess that the stats collection code needed to them would be useful here, too.

fawick avatar Oct 10 '17 10:10 fawick

@ignus2 thank you for describing your expectations and your reaction, that's very valuable for us as a project!

Restic snapshots can be compared more to "virtual machine snaphots" or "lvm/zfs file system snapshots" than e.g. a tar file of what has changed. If nothing has changed, a snapshot is still created to record "this was the current stat" at a particular point in time. Maybe we should add that to the manual.

fd0 avatar Oct 11 '17 05:10 fd0

So would it be possible to add the flag to skip creating a snapshot if nothing changed?

ignus2 avatar Oct 11 '17 08:10 ignus2

It would be possible to add this, but I don't think we'll add it: It's just not the way how restic works and will cause problems when you use the forget command.

fd0 avatar Oct 11 '17 12:10 fd0

It wouldn't change the way restic works by default, as it would be an optional flag. What kind of problems would it cause with the forget command btw?

ignus2 avatar Oct 11 '17 13:10 ignus2

Optional code paths need to be tested and maintained too! So, it creates ongoing technical debt for a feature that is a bit unusual.

I don't recall: did you explain why having "empty" snapshots is such a problem for your use case?

-- Michael L. Barrow michael at barrow dot me +1.541-600-2027

On Oct 11, 2017, at 06:04, Balázs Oroszi [email protected] wrote:

It wouldn't change the way restic works, as it would be an optional flag. What kind of problems would it cause with the forget command btw?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

mlbarrow avatar Oct 11 '17 14:10 mlbarrow

The idea behind the forget command (as explained e.g. in this blog entry) is that you specify a policy for snapshots that you'd like to retain. If you only have snapshots for when data has changed, specifying e.g. --keep-daily does not make sense any more.

There's really no such thing as an "empty" snapshot in restic. Each snapshot captures the data and metadata at a given point in time and is independent (concerning the data structures) from all the other snapshots.

Btw, if you really like to do that, you could use restic snapshots --json, then take the snapshot IDs, use restic cat snapshot <id> for each and drop the ones where the tree IDs haven't changed. That'd amount to roughly removing "empty" snapshots.

fd0 avatar Oct 11 '17 14:10 fd0

Yes I know they're not empty. I just used that term to more closely describe them per this particular scenario; "empty" implying little to no value to the OP based on the criteria of no files changing.

-- Michael L. Barrow michael at barrow dot me +1.541-600-2027

On Oct 11, 2017, at 07:30, Alexander Neumann [email protected] wrote:

The idea behind the forget command (as explained e.g. in this blog entry) is that you specify a policy for snapshots that you'd like to retain. If you only have snapshots for when data has changed, specifying e.g. --keep-daily does not make sense any more.

There's really no such thing as an "empty" snapshot in restic. Each snapshot captures the data and metadata at a given point in time and is independent (concerning the data structures) from all the other snapshots.

Btw, if you really like to do that, you could use restic snapshots --json, then take the snapshot IDs, use restic cat snapshot for each and drop the ones where the tree IDs haven't changed. That'd amount to roughly removing "empty" snapshots.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mlbarrow avatar Oct 11 '17 14:10 mlbarrow

I'm going to reopen this issue.

fd0 avatar Oct 11 '17 15:10 fd0

What is "unusual" is a matter of opinion I believe, for me having "empty" (by the definition of "having little to no value based on the criteria of no files and their metadata changing") snapshots is unusual.

Regarding the forget policy, I don't see how it would interfere. For example running restic backup occasionally (possibly depending on other means to determine whether something changed and a backup needs to be made or not) would have the same effect as skipping creating a snapshot, in which case forget also wouldn't make sense as you write.

I'd like to emphasize again, that it would be an optional feature for those who would like to use restic in a slightly different manner, who perhaps would never use the keep-daily etc forget features at all.

Thanks for mentioning a workaround btw.

ignus2 avatar Oct 11 '17 15:10 ignus2

I'm curious: Does other backup software typically have such an option?

fd0 avatar Oct 11 '17 15:10 fd0

I'm curious: Does other backup software typically have such an option?

I don't know, but if so, then restic should also, if not, restic could be unique in this regard ;)

BTW, I seem to understand the resistance to this feature, as restic puts emphasis on the "when" or "time" of the backup (also indicated by the way forget works, centered around time), hence a snapshot in time. While the use case I have in mind (and maybe the OP too) is more emphasis on the "changes" with the additional side-information of "when".

EDIT: Something like git, or is thinking about restic like the way git works (from the end user perspective) is a totally bad idea?

ignus2 avatar Oct 11 '17 15:10 ignus2

IMHO, this kind of "cleverness" has no place in backup software. If I ask backup software to do the backup, I'd like it to not play games behind my back, have opinions of it's own, nothing's changed etc... So, later tomorrow I scratch my head looking for last nights backup, which isn't there?!

Let's keep thing simple, if there's nothing new to backup, well.. then don't backup! Decide outside backup software, then dispatch backup or not. Too clever backup software would be unreliable backup sofware. I'd like it to be reliable, not too clever, if at all possible.

zcalusic avatar Oct 11 '17 15:10 zcalusic

I have never seen this, which is why I used "unusual" to describe this feature in my previous message.

The other reason why it's unusual is that it's kind of against the concept of a data protection solution. The snapshot represents the state of the world at that time (refer back to the explanation about the fact that snapshots aren't really empty).

If a snapshot wasn't removed, it would be hard to tell if the snapshots actually happened and was deemed unneeded by this criteria, or if the system failed.

michael at barrow dot me +1.541.600.2027

"Do not anticipate trouble, or worry about what may never happen. Keep in the sunlight." -- B. Franklin

On Wed, Oct 11, 2017 at 8:31 AM, Balázs Oroszi [email protected] wrote:

I'm curious: Does other backup software typically have such an option? I don't know, but if so, then restic should also, if not, restic could be unique in this regard ;)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/restic/restic/issues/662#issuecomment-335850739, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzVspDVasI91Mp4B2hFCMlsCWGr_B-Vks5srN9IgaJpZM4KrXNh .

mlbarrow avatar Oct 11 '17 16:10 mlbarrow

@zcalusic This "cleverness" would be hidden behind an optional command line switch, so only those users would be affected by it who specifically ask for it.

ignus2 avatar Oct 11 '17 16:10 ignus2

@ignus2: I get it, but any code added to the code base needs to be tested and maintained over time. Plus, it appears that your request is very unique.

Another way to achieve your end state would be to increase the RPO (Recovery Point Objective). In other words, if you know that your data changes on a less frequent basis, don't bother creating more frequent snapshots.

mlbarrow avatar Oct 11 '17 20:10 mlbarrow

I second @ignus2’s comment about the optional switch. There are obviously quite different use cases of restic.

Some want to do „traditional“ backups, let‘s say daily, let’s say for a lot of files, and be able to restore the state of day X if something bad happened on X+1. They know the time when the data was destroyed. Those users focus on the restoration of state at a given date. They never want to miss a snapshot.

Others (like me) want to use restic to snapshot the state of a folder quite often, maybe every 30min. The state at a given date is not important then, maybe they don‘t know when data was corrupted (e.g. file sync). Instead they want to quickly see if/when there were changes (e.g. to be able to track down the point where data corruption took place with as few diffs as possible). Those users focus on the points in time where changes took place. Snapshots without changes are just annoyances for them (e.g. when mounting the backup to do diffs between every two adjacent snapshots).

maxhq avatar Oct 11 '17 22:10 maxhq

What you describe sounds like you're attempting to use restic as a revision control system of sorts, no?

mlbarrow avatar Oct 12 '17 15:10 mlbarrow

I think it is irrelevant how to call it, the use case is clearly defined above, is a legitimate one, restic is more than capable to support that use case even now, but a manual workaround has to be employed (list snapshots, check treeid with previous, delete if same, prune), having a built-in optional switch to support this use case would make it much better and directly supported.

ignus2 avatar Oct 12 '17 15:10 ignus2

Well -- you have the source. Knock yourself out!

mlbarrow avatar Oct 12 '17 16:10 mlbarrow

In that case could you help out roughly what modifications would be needed (as on overview) and what to watch out for when implementing this feature? Thanks in advance.

ignus2 avatar Oct 12 '17 16:10 ignus2

Please don't work on this feature for now, I'd like to rework the archiver code first. Thanks.

I'd like to try again and explain my reservations in regards to the forget command again, I think that wasn't clear enough.

Suppose we have a user who runs restic backup every 30 mins automatically (e.g. via cron). On the next day, they run restic forget --keep-hourly 24.

Right now, the repository will then contain 24 snapshots, one for each hour of the previous day.

With this feature turned on (keeping only the snapshots in which data is changed), the results may vary wildly: The repository will still contain up to 24 snapshots, at most one per hour, but there may be a snapshot from last week, followed by one from two days before. Are you aware that forget works this way?

What bother me is that forget becomes pretty unpredictable.

So, for now we leave this issue open.

People reading this issue: If you have a use case that's not already describe (i.e. using restic as a kind of "version control" system), please add a comment.

Otherwise Please don't add further comments for now.

Thanks!

fd0 avatar Oct 12 '17 17:10 fd0

OK, though I already implemented a basic version as a first try and it works nicely (7 lines of code practically).

Regarding your comments about the forget command: it is not relevant here. I think it wasn't clear from @maxhq 's explanation: this mode of operation or flag (ie. skipping snapshots on no change) is not to be used and makes no sense in a use case that involves the forget command. The same way the forget command makes no sense in the use-case where skipping snapshots is involved.

To be more clear: users who wish to and need to use the forget command will not and should not use restic backup with the flag that skips snapshots on no change, and vice-versa.

ignus2 avatar Oct 12 '17 18:10 ignus2