borg icon indicating copy to clipboard operation
borg copied to clipboard

Reported original size regression?

Open MichaelDietzel opened this issue 6 months ago • 3 comments

I have an old borg repository and want to recreate it using different chunker params. So I mount the repository and back up the mounted archives again to a new repository. I want to make sure that nothing gets lost in this process so I first check if the original sizes of the backups match. And to my surprise the new backup reports a bigger original size.

My old archive was created on 2021-11-22 (I do not know how to find the version I used) and borg info reports an original size of 967.53 GB. The new archive reports 967.99 GB when creating it.

So I tried multiple different things to find out what the difference was, but everything appears to be the same. When I run borg list repo::archive --format="{size}{NEWLINE}" and add up all the reported sizes and I get exactly 967 532 246 644 Bytes as the result, both for the old and the new repo/archive. This nicely matches the size reported by borg info for the old archive, so it appears to me that the "Original size" reported by new versions of borg is too high. I also tried some older versions of borg so I could maybe find out which version causes this, but all versions I tried (1.1.18, 1.2.2, 1.2.8 and 1.4.0) had the same issue, although not all reported the same "Original size". I sadly was not able to run versions older than 1.1.18 (which is from 2022-06-05 and thus still newer than my archive) due to multiple error messages. Maybe I could try switching to an older OS for being able to run an even older version.

Some further infos:

  • The original backup was taken from ntfs. I am not sure if I took it over smb or mounted it on linux.
  • The backup contains hardlinks but apparently borg counts them more than once, otherwise the result of borg list should not match
  • I probably took the original backup using ubuntu 20.04
  • I am running borg in a debian bookworm lxc-container under proxmox

So do you have any idea what could be going on? Is there anything I could be doing wrong?

Thank you!

MichaelDietzel avatar Jun 03 '25 20:06 MichaelDietzel

My suspicion is that this might be related to checkpointing and .part files.

But as a general comment: be careful with just backing up a mounted borg archive again, borg mount does not support all metadata (e.g. no ACLs). The simple stuff like filenames, mtime and content will work though.

ThomasWaldmann avatar Jun 03 '25 23:06 ThomasWaldmann

Thank you for your quick reply!

I think it should not be related to checkpoints as I basically disabled them by using a large interval to avoid such cases. I should have included my commands so that you could see that I used a checkpoint interval of 360000 (which I think should be 100 hours).

I did some further testing and now I think I know a little more. I started to suspect that the size of the metadata is included in the original size and that turned out to be at least partially true. I changed the code of borg 1.4.1 to report the exact sizes instead of the "human readable" numbers so I could see what is going on more easily. For that I just modified the function sizeof_fmt in parseformat.py

For my first Tests I just used a "small" file of just 2 198 863 872 Bytes (~2.2GB) in a freshly created empty repo. This is the output of borg create

borg create --stats --progress --chunker-params 15,23,17,4095 --compression zstd,10 --files-cache=disabled --checkpoint-interval 360000 repo::test testfile
------------------------------------------------------------------------------                                                                                                                                                              
Repository: repo
Archive name: test
Archive fingerprint: 446ed6fb0d3eaca676cd652511d65b195fee971653a6429565ef6134e306a80b
Time (start): Thu, 2025-06-05 08:47:13
Time (end):   Thu, 2025-06-05 08:47:27
Duration: 14.06 seconds
Number of files: 1
Utilization of max. archive size: 0%
------------------------------------------------------------------------------
                    Original size      Compressed size    Deduplicated size
This archive:           2198864700 B          962330919 B          955690447 B
All archives:           2198863872 B          962330172 B          955952148 B

                    Unique chunks         Total chunks
Chunk index:                    6452                 6641
------------------------------------------------------------------------------

This is the output of borg info

borg info repo::test
Archive name: test
Archive fingerprint: 446ed6fb0d3eaca676cd652511d65b195fee971653a6429565ef6134e306a80b
Comment: 
Hostname: ...
Username: michael
Time (start): Thu, 2025-06-05 08:47:13
Time (end): Thu, 2025-06-05 08:47:27
Duration: 14.06 seconds
Number of files: 1
Command line: /home/michael/borg_dev/borg-env_1.4.1_custom/bin/borg create --stats --progress --chunker-params 15,23,17,4095 --compression zstd,10 --files-cache=disabled --checkpoint-interval 360000 repo::test testfile
Utilization of maximum supported archive size: 0%
------------------------------------------------------------------------------
                    Original size      Compressed size    Deduplicated size
This archive:           2198863872 B          962330172 B          955952148 B
All archives:           2198863872 B          962330172 B          955952148 B

                    Unique chunks         Total chunks
Chunk index:                    6452                 6641

Here we can see multiple things:

  • The sizes that borg create reported for This archive are slightly larger than the sizes reported for All archives
  • The Original size for All archives is the correct file size
  • borg info reports different sizes than borg create
  • The sizes for borg info are correct

My next step then was to modify the source in a way that the size of the metadata is no longer counted in the statistics. Then I deleted the repository, restored a copy and re-ran my tests. The Results:

  • all the values for Original size matched the file size
  • Compressed size was reprorted to be identical in all cases and it matched the size reported by borg info from above
  • I still got two different values for Deduplicated size when running borg create. The values of borg info both matched the value for All archives of borg create. But they didn't match the values from above, which surprised me.

Lastly I also re-ran the tests for my real archive using an unmodified version of borg 1.4.1 but there all of the low precision numbers reported by create and info matched each other. So apparently in that case the size of the metadata is (mostly) included in the reported numbers. So I still don't know why for my old archive the numbers seem to be reported without metadata. I plan to next re-run the tests using my modified version of borg that shows the high precision numbers on the real archive and then again compare the numbers for with and without metadata. But this will take a while.

So it appears to me that there are some (slight) inconsistencies of how the sizes are reported. What is the intended size to be reported? I would personally prefer if it was just the size of the files without the metadata for This archive but I have multiple other thoughts on that:

  • maybe others prefer the size of the metadata to be included
  • I only implemented the metadata exclusion for create, it could be harder to do for info
    • I do not really trust my implemenation as my understanding of borgs code is still limited. At least it broke no tests. So the tests do not seem to really care for the exact sizes in the statistics. Maybe there should be some tests for that?
  • It could be even harder to implement for the sizes of the whole archive
  • Maybe a compromise would be reporting both types (with and without metadata) for "This archive"?
  • Maybe the behavior of the statistics could be documented (I didn't find much about it in the documentation. Did I overlook it?)

Are these inconsistencies interesting to you and is there a chance they could be fixed, or is there no point in me digging further into this?

MichaelDietzel avatar Jun 05 '25 07:06 MichaelDietzel

Well, guess it is a matter of taste/definition whether the metadata size should be included into the computation.

But I think the numbers shown should be consistent at least or they will raise eyebrows.

So if we can do some small changes to get them consistent, that would be nice.

Please note that this mostly affects borg 1.x. In master branch (borg2), I got rid of most of the statistics - because some of the stuff can't be easily computed anymore and a lot of stats stuff was simply "in the way" while doing the big changes for borg2.

ThomasWaldmann avatar Jun 05 '25 08:06 ThomasWaldmann

@MichaelDietzel can you review #9003?

ThomasWaldmann avatar Sep 06 '25 21:09 ThomasWaldmann

@MichaelDietzel can you test or review that PR?

ThomasWaldmann avatar Sep 21 '25 15:09 ThomasWaldmann

Sorry for not replying earlier. I was really busy in the last time and not at home the last few weekends so I did not manage to test the PR, yet. I should be able to finally do it in the next days. Thanks for working on this. I spend a few hours myself to try to fix this (mostly trying to exactly understand everything that is being counted into the stats) but there still are a few parts that I do not understand, so I am looking forward to seeing what you did!

MichaelDietzel avatar Sep 24 '25 21:09 MichaelDietzel

Fixed in 1.4-maint by #9003.

Master branch does much less stats and very differently, so this does not apply there.

ThomasWaldmann avatar Oct 27 '25 23:10 ThomasWaldmann