Remove or replace archive size column with different metric
On vorta's Archives tab, it shows a list of archives with their sizes.
Guess the size means the deduplicated size (dcsize) of the respective archive, which is always relative to the overall repo state.
So, if one starts from:
2020-06-11 23:00 24MB ... archive2
2020-06-11 22:00 50GB ... archive1
Now one decides to delete archive1, after that it will show:
2020-06-11 23:00 24MB ... archive2
This is not correct, but just showing the value it has determined previously.
But due to the overlap of archive1 and archive2, most of the volume of the deleted archive1 should now show up for archive2.
The fix is to run borg info repo::archive for every archive after deleting or pruning to refresh the size values. Same for adding archives, this can also change this size value. Guess one should do that when something has changed within the manifest, so one could even catch changes not done via vorta.
The "refresh" button does not do that, so clicking it doesn't fix the problem.
The fix is to run
borg info repo::archivefor every archive after deleting or pruning to refresh the size values. Same for adding archives, this can also change this size value. Guess one should do that when something has changed within the manifest, so one could even catch changes not done via vorta.
We discussed this before in https://github.com/borgbase/vorta/issues/341 and someone suggested to run borg info {repo} --last 99999 --json.
But I found this takes a long time to run, so it's not practical for us. Users would wonder what's going on.
Currently we run borg list $REPO after most operations to sync the list of archives. This is fast, but doesn't give the sizes or many details.
But I found this takes a long time to run, so it's not practical for us.
What is the concrete time difference here?
Users would wonder what's going on.
Would it not be possible to simply explain this to users?
side note: from borg 1.2 changelog:
include size/csize/nfiles[_parts] stats into archive, https://github.com/borgbackup/borg/issues/3241
so, in future we can get some size information (from newly created archives) rather quickly, but of course not the deduplicated size which is expensive to compute and relative to all other repo contents (and thus bad to cache).
so, how about just showing the size and the csize of an archive? even now, they are not relative to all other repo content, so can be cached.
the dcsize of all archives could be shown as long as the manifest did not change, then dcsize should not be shown any more until it is correctly and expensively recomputed. if we decide this is too expensive to do, then the consequence must be not to show it (there is no point in showing information that is frequently incorrent).
so, how about just showing the size and the csize of an archive? even now, they are not relative to all other repo content, so can be cached.
Will try this out with 1.2 and then add a feature flag for users of that version.
I just ran into this problem (I have been using Vorta and borg for less than a month). If I understand it correctly, recomputing sizes for all archives would be too slow. However, isn't the recomputation of just the first ones after those that were deleted enough?
In my example I have (... and so on...) archive4 10,1 MB archive3 78,4 MB archive2 12,4 MB archive1 44,5 GB
and I deleted archive1. Now I just see: (... and so on...) archive4 10,1 MB archive3 78,4 MB archive2 12,4 MB
and I've completely lost information about the initial size of 44,5 GB of the first backup, which is quite informative about the actual space occupied on the backup device (excluding the increments, of course). Isn't archive2 the only one for which recomputation is needed?
Today I made some tests for this. Running "borg info -P
- the recomputed "deduplicated size" is different not only for the first non-deleted archive, but for almost all of them
- it still shows me a deduplicated size of 75.97 MB for the first archive, which is nothing near the ~45 GB I would have expected; this made me understand what ThomasWaldmann was saying, that is that the deduplicated size is intended to be repository-wide and not computed in a linear way (from the first backup on); this also explains the previous point
A deduplicated size computed in this way it a very interesting information, of course, however it highlights how the "size" column in Archives tab in Vorta is probably pretty much useless unless it gets updated on every archive deletion or new archive creation, but it can be costly for huge repositories. Probably the original and compressed sizes would be more useful indications for the archives: they would not require an update and would help the user to get an idea on how big is his/her backup set. A deduplicated size indication for just the whole repository would then be another useful information to show, which is very quick to update after each action and would help the user to understand how much total space it's being used on the backup repository. Then, a "Get archive info" button + perhaps a dialog could let the user retrieve detailed information on a single archive if he/she wants to learn more about that specific archive.
This is just my 2 cents.
Thanks for the thorough test. Definitely makes a case to remove or change this column.
Maybe replace the size with compression ratio?
I don't know whether this is related or not: if not, please advise, I can open a new bug report. I see now that there's a repository-wide size indication in the "Repository" tab. However, I can't understand if the values I see there are right, if they are also cached, etc. I see (translating from Italian):
Original size: 27.1 TB Deduplicated size: 192.9 GB Compressed size: 69.0 GB
First of all, I suspect that the deduplicated and compressed sizes are swapped: I would have expected the deduplicated size to be smaller than the compressed size. Maybe an Italian localization problem? Also, if I issue a borg info command, this is what I get:
Original size: 24.75 TB Compressed size: 12.50 TB Deduplicated size: 67.14 GB
which are quite different values. Maybe again a cache problem?
I would agree that the "deduplicated" size per archive is almost useless. If you perform a 1st backup (archive1) and get a reported deduplicated size, then perform another backup (archive2) with no changes, the deduplicated size for archive2 is 0 (practically 0). BUT, IT IS NOW ALSO 0 for archive1! --- Because archive1 and archive2 both do not contain any unique data, so they both now report deduplicated size of about 0. If you delete the 2nd archive (archive2) then the deduplicated size of the earlier archive (archive1) now becomes non-zero again. So just checking/recomputing later archive sizes when deleting an archive does not correctly update deduplicated size of archives - you must recompute them for all archives if you expect them to show correct deduplicated sizes.