get rid of refcounts?
borg currently is usually precisely refcounting objects. this ticket is split off from #7377 to focus on refcounting here only:
- precise refcounting 0..N (like currently done in borg)
- is
object_still_needed = bool(refcount > 0)all we really need?
about refcounts
They are a pain to maintain:
- if something crashes
- if we have concurrent operations in the repo, esp. if multiple borg processes would concurrently (or even just in an alternating order) update the repo. client1 does not know what client2 is doing and vice versa. so clientside refcounts are easily outdated for multi-client scenarios and need expensive fixing (iterating over all archives, counting all references).
What are refcounts needed for?
- detecting
refcount == 0is needed to decide about whether an object should be deleted (equivalent:id in unreferenced_set) - when borg create adds a new chunk (this is when refcount changes
0 -> 1, but we can also know it because the chunk does not exist and we add it), it is considered unique for this archive and contributes to the deduplicated size of it (this is interesting when borg create just finished and before creating the next archive, but as soon as the next archive references the same chunk again, it is not in the deduplicated size any more, because now 2 archives use this chunk - so this is a bit of a problematic statistics value)
Code using refcounts
ChunkIndexEntryhas(refcount, size)- cleanup when we have written file content chunks, but did not reach the code writing the file item that references these chunks: the exception handler decrefs all the chunks we have already written/incref'd.
- tests and debug commands
sum_unique_chunks_metadatamark_as_possibly_supersededchecks for refcount 0orphan_chunks_checkchecks for refcount 0
LocalCache
cache.seen_chunkreturns refcountcache.add_chunkdecides with it whether to process/write a chunk or just increase refcountcache.chunk_decref(when refcount reaches 0) deletes the chunk from the repo and also the entry in the chunks index and adjusts size statscache.chunk_increfused to increment the refcount of an already existing chunk
AdhocCache
Guess this cache already goes half the way we need to go:
AdhocCacheworks differently with refcount- sets it to
MAX_VALUEfor existing chunks in the repo (== unknown refcount, assume infinite, never delete) - tracks precisely for new chunks added within the current session (for this case, we can know precisely)
- sets it to
- It does not implement a files cache yet (which is crucial for borg's high speed with unchanged files):
- "would require persistence" (sure, but rather use some local space than have slow backups with lots of I/O)
- another reason is likely that the files cache we have in LocalCache only has IDs in the chunks lists, but not chunk sizes (which we also need to create an Item we write to an archive). Also, the AdhocCache chunks index does not know about chunk plaintext sizes either for already existing chunks (they have size==0).
- so, we need to get the chunk sizes from somewhere: we either need to add them to the files cache or persist the chunks index.
How to improve?
All we need for basic borg create is the set of chunkids of the chunks existing in the repo.
- For remote as well as for local repos, we do not want to ask the repo about the existence of each chunk, but rather have a quick (client side) hashtable lookup to decide that.
- if it exists, we just reference it and are done.
- if not, we need to compress/encrypt/auth/write it to the repo (and add it to the set of chunkids in the repo).
- borg create could just internally sum up newly written chunks (no need to refcount for that).
- accept some orphan chunks (if we don't precisely refcount):
- if we run into an Backup(OS)Error exception before writing a file item to the archive, the already written content chunks will be orphans (if we can't clean them up in exception handler due to lack of precise refcounting). this kind of exception is not fatal, thus borg would still continue with remaining files (and commit the transaction at the end).
- if we would also go away from transaction rollback, we might get many orphans if some backup is aborted.
borg delete/prune would just remove the archive entry from manifest (== the root object of this archive, which is for sure not needed for anything else / not referenced by anything else), creating many orphans and leave all else to borg gc.
borg gc / borg check would kill everything not referenced / all the orphans:
- It does not even need real refcounts, but could start with a set of existing object IDs and after encountering a reference add it to a set of referenced object IDs.
orphans = existing - referenced. - Or start with everything in orphans set and move IDs to referenced set when a reference is encountered. At the end, we can delete everything that remains in orphans set.
Updates:
- implemented
AdhocCacheWithFilesa while ago - currently working on an experimental branch with
borg compactimplementing a garbage collector