borg get rid of refcounts?

borg currently is usually precisely refcounting objects. this ticket is split off from #7377 to focus on refcounting here only:

precise refcounting 0..N (like currently done in borg)
is object_still_needed = bool(refcount > 0) all we really need?

Feb 24 '23 19:02 ThomasWaldmann

about refcounts

They are a pain to maintain:

if something crashes
if we have concurrent operations in the repo, esp. if multiple borg processes would concurrently (or even just in an alternating order) update the repo. client1 does not know what client2 is doing and vice versa. so clientside refcounts are easily outdated for multi-client scenarios and need expensive fixing (iterating over all archives, counting all references).

What are refcounts needed for?

detecting refcount == 0 is needed to decide about whether an object should be deleted (equivalent: id in unreferenced_set)
when borg create adds a new chunk (this is when refcount changes 0 -> 1, but we can also know it because the chunk does not exist and we add it), it is considered unique for this archive and contributes to the deduplicated size of it (this is interesting when borg create just finished and before creating the next archive, but as soon as the next archive references the same chunk again, it is not in the deduplicated size any more, because now 2 archives use this chunk - so this is a bit of a problematic statistics value)

Code using refcounts

ChunkIndexEntry has (refcount, size)
cleanup when we have written file content chunks, but did not reach the code writing the file item that references these chunks: the exception handler decrefs all the chunks we have already written/incref'd.
tests and debug commands
sum_unique_chunks_metadata
mark_as_possibly_superseded checks for refcount 0
orphan_chunks_check checks for refcount 0

LocalCache

cache.seen_chunk returns refcount
cache.add_chunk decides with it whether to process/write a chunk or just increase refcount
cache.chunk_decref (when refcount reaches 0) deletes the chunk from the repo and also the entry in the chunks index and adjusts size stats
cache.chunk_incref used to increment the refcount of an already existing chunk

AdhocCache

Guess this cache already goes half the way we need to go:

AdhocCache works differently with refcount
- sets it to MAX_VALUE for existing chunks in the repo (== unknown refcount, assume infinite, never delete)
- tracks precisely for new chunks added within the current session (for this case, we can know precisely)
It does not implement a files cache yet (which is crucial for borg's high speed with unchanged files):
- "would require persistence" (sure, but rather use some local space than have slow backups with lots of I/O)
- another reason is likely that the files cache we have in LocalCache only has IDs in the chunks lists, but not chunk sizes (which we also need to create an Item we write to an archive). Also, the AdhocCache chunks index does not know about chunk plaintext sizes either for already existing chunks (they have size==0).
- so, we need to get the chunk sizes from somewhere: we either need to add them to the files cache or persist the chunks index.

How to improve?

All we need for basic borg create is the set of chunkids of the chunks existing in the repo.

For remote as well as for local repos, we do not want to ask the repo about the existence of each chunk, but rather have a quick (client side) hashtable lookup to decide that.
if it exists, we just reference it and are done.
if not, we need to compress/encrypt/auth/write it to the repo (and add it to the set of chunkids in the repo).
borg create could just internally sum up newly written chunks (no need to refcount for that).
accept some orphan chunks (if we don't precisely refcount):
- if we run into an Backup(OS)Error exception before writing a file item to the archive, the already written content chunks will be orphans (if we can't clean them up in exception handler due to lack of precise refcounting). this kind of exception is not fatal, thus borg would still continue with remaining files (and commit the transaction at the end).
- if we would also go away from transaction rollback, we might get many orphans if some backup is aborted.

borg delete/prune would just remove the archive entry from manifest (== the root object of this archive, which is for sure not needed for anything else / not referenced by anything else), creating many orphans and leave all else to borg gc.

borg gc / borg check would kill everything not referenced / all the orphans:

It does not even need real refcounts, but could start with a set of existing object IDs and after encountering a reference add it to a set of referenced object IDs. orphans = existing - referenced.
Or start with everything in orphans set and move IDs to referenced set when a reference is encountered. At the end, we can delete everything that remains in orphans set.

Feb 24 '23 19:02 ThomasWaldmann

Updates:

implemented AdhocCacheWithFiles a while ago
currently working on an experimental branch with borg compact implementing a garbage collector

Aug 13 '24 15:08 ThomasWaldmann