borg icon indicating copy to clipboard operation
borg copied to clipboard

the new (c)size ticket

Open ThomasWaldmann opened this issue 9 years ago • 9 comments

scope of this ticket

let's concentrate here on the issue of csize (and also size) information in the items' chunks lists, in the chunks and files cache. no crypto or other discussion in here, let's stay focussed.

csize

the main issue is that csize is not a direct function of the data, it also depends on compression and encryption (and other overhead) that is applied to the data. as both might change (and thus csize might change) while the chunk still contains the same (plaintext) content and has the same id, it is an annoyance to have csize in the chunks lists of archived items.

size

we must have chunk size information in the chunks lists of archived items for the case we lose multiple chunks in the repo - so we can replace them with all-zero chunks of same length. size is a direct function of the data, so no problem here if we change compression/encryption/overhead.

timing of size / csize computation

  • size is computed early, after the chunker has cut the chunks: len(chunk)
  • csize is computed late, after compression, after encryption/authentication. note: this can lead to a race (wait) condition in multithreaded processing.

where is chunk size/csize (not) stored?

  • repo: the current PUT entry in the segment file contains csize in the length information. no size available here! also, neither size nor csize is in the repo index.
  • archive: item.chunks = [(id, size, csize), ...]
  • chunks cache: id -> (refcount, size, csize)
  • files cache: no size/csize here: path_hash -> (file_size, ino, mtime, chunks=[id, id, ...])

where is size/csize used?

  • size dsize csize dcsize placeholders
  • Statistics class + show_progress
  • chunk_incref (gets size/csize from chunks cache - important for archiving unchanged files)
  • csize: Archive.info -> limits -> max_archive_size and Archive.__str__
  • csize: Cache.__str__ .chunks_stored_size
  • size: do_diff sum_chunk_sizes (to show sum of lengths of added/removed chunks of a file)
  • size: borg check size consistency check item.size == sum(chunks size)
  • tests

ThomasWaldmann avatar Mar 30 '17 13:03 ThomasWaldmann

Comment by @enkore, moved from #2313:

The main problem with csize is not so much compatibility problems or something like that, but this issue:

  • Create an archive with some new chunks, they will have csize set
  • Create another archive, but without the cache, so with csize=0 (which is unambiguous) for these chunks
  • Delete the first archive
  • Sync the cache
  • Then the cache cannot know the csize of these chunks
  • So we need to ask the repository, which means that has to do per-chunk I/O, because the csize isn't in the repo index, either (which makes a lot of sense)

But that might be ok, since it's a contrived example. A worst case in the very literal sense. Adding a repository API that allows to only get the length of a chunk (and falling back to len(GET()) in the client if that API is not available) would avoid network cost, and mostly leave the I/O.

Compatibility issues are minor, e.g. an older Borg will show too small compressed/dedup sizes in borg info.

ThomasWaldmann avatar Mar 30 '17 15:03 ThomasWaldmann

solving the special problems for unchanged-skipped files

progress indication / correct size/csize in archived items:

  • we get their chunk IDs list from the files cache.

currently, we only have the chunk ids there, not their sizes/csizes.

we do not really have chunk content at hand here, so no information about size or csize - we get them via chunk_incref from chunks index for progress indication and also to generate the chunks list for the item (with correct size and csize).

  • in the adhoc chunks cache (see #2350), we do not have correct size or csize information!

  • we could modify files cache to include size (csize), then progress indication would work normally for unchanged files and also the archived items would have correct size/csize infos in their chunks list.

    memory usage: 32b str -> tuple(32b str, int, int) it is msgpacked, so maybe not that bad.

  • this would need to update chunks cache entries with unknown (c)size to the known (c)size from the files cache.

ThomasWaldmann avatar Mar 30 '17 16:03 ThomasWaldmann

Proposal

Problem

(1) Requiring csize means that things like AdhocChunksCache don't work, but we want that.

(2) csize in all references to a chunk also means that changing the chunk's compression means that stats can be slightly off depending on which version of the reference is seen first during cache sync.

(3) csize creates a dependency on the chunk processing before a chunk reference can be stored

Solution

(1) Introduce csize=0 (compact encoding, one byte) for these.

After a cache sync, the client iterates over the chunks cache and will retrieve all objects with csize=0 from the repository and set the csize according to the length of the retrieved object. A dedicated API to avoid transferring the object data itself may be added for this. This fixes the scenario:

  • Create an archive with some new chunks, they will have csize set
  • Create another archive, but without the cache, so with csize=0 (which is unambiguous) for these chunks
  • Delete the first archive
  • Sync the cache
  • Then the cache cannot know the csize of these chunks
  • So we need to ask the repository, which means that has to do per-chunk I/O, because the csize isn't in the repo index, either (which makes a lot of sense)

(2) Disregard slightly off stats due to recompression, and incomplete stats (at archive creation time) with AdhocChunksCache.

(3) The dependency of chunk references on csize and therefore chunk processing for a new chunk is actually an advantage, since it ensures that the system is always in a forward-consistent state: With the dependency on csize, the cache can only emit references to chunks that are already stored. Therefore, receiving a chunk reference from the cache implies that it will be contained within a repository commit initiated after receiving the reference. This simplifies reasoning about the system considerably, especially in a concurrent setting.

enkore avatar Apr 06 '17 10:04 enkore

@enkore If you want to be able to implement the AdhocChunkscache later, I guess your proposal needs to address size also, not just csize, see my comments above. Even with my idea, there is a problem if the files cache is lost.

ThomasWaldmann avatar Apr 06 '17 12:04 ThomasWaldmann

A first solution could be to just not use a files cache in this case.

This is may be a bit annoying with 1.1, though a quick back of the envelope calculation suggests that in many cases it would still be worth it. E.g. I know that with --no-files-cache a system backup needs about 15 minutes, but a cache sync to one of my larger repos takes longer than that. So --no-cache-sync (implying --no-files-cache) would still be faster. And with 1.2 this becomes even less of an issue.

enkore avatar Apr 06 '17 12:04 enkore

#2654 implements my proposal above to the word.

enkore avatar Jun 10 '17 16:06 enkore

#6763 removes csize everywhere.

something related is still in the entry length of a segment file entry and might come to the repo index via #6705.

ThomasWaldmann avatar Jun 12 '22 15:06 ThomasWaldmann

This is now in borg2 branch, so the csize related issues are solved by removing it everywhere.

Update: merged into master now.

ThomasWaldmann avatar Jul 04 '22 16:07 ThomasWaldmann

Note: size, csize, ctype, clevel is now available as separate encrypted metadata via repo.get(id, read_data=False) and RepoObj.parse_meta(chunk).

ThomasWaldmann avatar Sep 09 '22 19:09 ThomasWaldmann

IIRC / AFAIK, there is nothing left we need to do right now, but we could do some improvements later:

  • improve the adhoc chunks cache
  • improve stats / info

ThomasWaldmann avatar Jan 22 '23 13:01 ThomasWaldmann