datakit icon indicating copy to clipboard operation
datakit copied to clipboard

Implement a GC

Open samoht opened this issue 9 years ago • 4 comments

There is no GC in Irmin (we just use git gc for now). This needs to be fixed if we want to version control everything. Moreover, we need to understand which hooks need to be exposed by the storage backend so we can register the hooks that we need to have a high-performance GC.

See mirage/irmin#71 and http://lists.xenproject.org/archives/html/mirageos-devel/2015-10/msg00040.html

samoht avatar Mar 01 '16 19:03 samoht

@kayceesrk you expressed interest about that. You are still very welcome to have a look at it :-)

samoht avatar May 23 '16 18:05 samoht

Yes. I was looking at this a few weeks ago, and I am still interested in doing this. I did have a few questions regarding this:

  • How/when do objects become unreachable in a typical git workflow?
  • What is the root set for the GC? Given that the data structures are persistent (I am imagining merge-queues for simplicitly and the fact that I understand how they work), there is always a way to reach objects in the past by checking out a previous commit.
  • In the case of persistent data structures, references are embedded in the object in an AO store. How does the GC distinguish between values and references?

It would help to have an example of a scenario where objects become unreachable.

kayceesrk avatar May 23 '16 19:05 kayceesrk

I think there are various levels of complexity for that task.

  1. GC is done off-line, blobs do not contain pointers to other objects and roots are the Git references. Basically, it amounts of running git gc on process start (pretty easy) or re-implement something similar in Irmin (a bit more involved, especially if we are interested in the pack compression but doable)
  2. GC is done online, blobs do not contain pointers to other objects and roots are the Git references and temporary branches. Very similar to what described above, just need to be careful with locking -- also need to register temporary roots with the temporary (anonymous) branches.
  3. GC is done online, blobs can contain pointers to other objects and roots are the Git references and temporary branches. This need a change in Irmin datamodel, so probably API breakage and I suspect a bigger impact.

We need to ship 1. pretty soon, but I'll be interested to have PoC for 2. as well. 3. needs design discussion about API changes for Blobs.

samoht avatar May 23 '16 19:05 samoht

Unreachability is usually done when people rebase/delete a branch.

samoht avatar May 23 '16 20:05 samoht