commitlog icon indicating copy to clipboard operation
commitlog copied to clipboard

Support compaction &| deletion.

Open thedodd opened this issue 5 years ago • 3 comments

Looks like there is already an issue open for compression. Awesome. In this case, I'm looking for a way to remove old entries. Perhaps from a given offset and back. I see the truncation method, that is definitely useful for certain cases, especially when dealing with Raft and such.

In the case of compaction/deletion, the use case is where a log is only intended to be kept around for some specific amount of time, or where log entries are to be deleted after some specific amount of time. EG, keep messages around for 1 week, after that, remove them.

I'm happy to implement this, as I am strongly considering using this for a project of mine, just wanted to pop a ticket for some general discussion. Thoughts?

thedodd avatar May 30 '19 21:05 thedodd

Thanks for filing this.

Some sort of compaction/rewrite functionality would be really useful. I think there are a couple cases here:

  1. Time-based retention (e.g. Kafka does this by keeping a time index in addition to a offset -> address index)
  2. Rewrite the log by removing certain entries (in Kafka parlance, this is a compacted topic)

When I was thinking about this a few months ago, I think a generalized indexing scheme is what I was envisioning for the time support, where you could have some sort of custom index for a field like "timestamp" but we wouldn't have to introduce time concepts throughout the code base.

For key-based compaction, one could use a key index along with the custom code to actually do the comparison for compaction.

The other thing worth thinking through is the complexity of removing the segments themselves vs. the complexity in doing a full rewrite of the segment with some of the log truncated. Both pieces of functionality would be interesting, but worth considering.

What exact requirements are you needing sooner rather than later?

zowens avatar Jun 01 '19 22:06 zowens

Ignoring the index mapping from time based indices to offsets, a first step would be to just add the ability to do approximate and exact deletion.

Where approximate deletion would quickly just delete all segments which are less than the segment which contains the given lower_bound offset.

And exact deletion would actually create a new segment which truncates the segment which contains the offset to not contain any extra records.

These two methods would likely exist regardless of additional features like time based indices. I have no idea how Kafka implements its retention, but these approaches seem obvious to me when I was imagining how to implement it, since it's an important feature for me. I likely will add it in a fork. If they seem like reasonable methods then I can put them into a PR.

norcalli avatar Feb 28 '21 03:02 norcalli

@norcalli Agreed, those seem like reasonable approaches. Feel free to PR it!

zowens avatar Mar 06 '21 18:03 zowens