datakit
datakit copied to clipboard
Implement a GC
There is no GC in Irmin (we just use git gc for now). This needs to be fixed if we want to version control everything. Moreover, we need to understand which hooks need to be exposed by the storage backend so we can register the hooks that we need to have a high-performance GC.
See mirage/irmin#71 and http://lists.xenproject.org/archives/html/mirageos-devel/2015-10/msg00040.html
@kayceesrk you expressed interest about that. You are still very welcome to have a look at it :-)
Yes. I was looking at this a few weeks ago, and I am still interested in doing this. I did have a few questions regarding this:
- How/when do objects become unreachable in a typical git workflow?
- What is the root set for the GC? Given that the data structures are persistent (I am imagining merge-queues for simplicitly and the fact that I understand how they work), there is always a way to reach objects in the past by checking out a previous commit.
- In the case of persistent data structures, references are embedded in the object in an AO store. How does the GC distinguish between values and references?
It would help to have an example of a scenario where objects become unreachable.
I think there are various levels of complexity for that task.
- GC is done off-line, blobs do not contain pointers to other objects and roots are the Git references. Basically, it amounts of running
git gcon process start (pretty easy) or re-implement something similar in Irmin (a bit more involved, especially if we are interested in the pack compression but doable) - GC is done online, blobs do not contain pointers to other objects and roots are the Git references and temporary branches. Very similar to what described above, just need to be careful with locking -- also need to register temporary roots with the temporary (anonymous) branches.
- GC is done online, blobs can contain pointers to other objects and roots are the Git references and temporary branches. This need a change in Irmin datamodel, so probably API breakage and I suspect a bigger impact.
We need to ship 1. pretty soon, but I'll be interested to have PoC for 2. as well. 3. needs design discussion about API changes for Blobs.
Unreachability is usually done when people rebase/delete a branch.