project-m36 create a manual garbage collection utility

Project:M36 currently uses a very simple Binary dump to store each transaction. However, if each commit grows the database, we can see O(n^2) growth in overall database size. As a short-term counter-measure, we could implement a garbage collection utility which would take an existing database and trim it to only its latest transactions, deleting all history before that.

Another possible way to reduce database size would be to implement some compression scheme.

Sep 20 '17 02:09 agentm

C.f. #146

Sep 20 '17 04:09 3noch

How is this GC supposed to know that a transaction is not used anymore in the snaphsot at a given time ? As far as I understand certain transactions can be removed without changing the current state and some of them are needed for the current state, perhaps some measure of information containement could be tied to the transactions when they are commited ?

Mar 22 '18 09:03 theobat

This utility is sort on the backburner.

The reason garbage collection would be useful now is because our file format is overly simplistic (but correct). Each transaction is written out via Binary as a completely self-sufficient transaction without data dependencies on the previous transaction. Obviously, this is very wasteful in terms of IO and storage, so we have a new design coming down the pipeline.

I'm happy to discuss the proposed design in greater detail.

Mar 22 '18 17:03 agentm

The new design should obviate or reduce the need for garbage collection though I could imagine a collection system which would aggregate old transactions into one when one no longer cares about history older than X.

Mar 22 '18 17:03 agentm

@agentm Do you mean that with the previous design the entire database history was/is rewritten at every transaction ? I'm not sure I understand

Mar 26 '18 14:03 theobat

That's almost right. The current, suboptimal design is that the DatabaseContext's record items are serialized via Binary to disk without referencing previous transactions. That means that every relation variables is re-serialized to disk on each commit. The benefit of this approach is mere simplicity and in validating correctness.

So, the entire database history is not rewritten, rather each new transaction's state is serialized to disk without consideration for what was already previously serialized.

I am currently working on the very obvious optimizations to reduce IO here.

Mar 26 '18 15:03 agentm

Hi @agentm,

Do you have an idea when the new design might be implemented? Is there a ticket for it specifically? From my perspective the current persistence strategy is the biggest issue with this project at the moment, as it not only requires excessive storage space, but also has massive performance implications. I would love to use project-m36 for an upcoming product, but I'm pretty sure that longer term performance will be unacceptable (say, with a state in the order of 100MB - 1GB).

Jun 18 '18 08:06 matchwood

@matchwood, I definitely empathize with you here. The current storage mechanism is a very basic implementation for durable writes and there is much low-hanging fruit to be eliminated. I am working on a storage backend using cborg, but cborg has its own set of problems.

I wish I could offer a better answer than "a few months away", but I have to contend with my daily non-Project:M36 tasks.

Please stay tuned- I think we will have some unique features in the storage space that other databases won't be able to match (such as O(1) commits), so I am excited to get them into the project.

Jun 18 '18 23:06 agentm

I have started work on my own manual garbage collection utility. My use-case is for a long-running process that logs every few minutes and I was able to reduce the size from 8 MB for a few hundred entries to 80 KB.

This utility has a couple drawbacks in its current state:

It is not thread-safe. I still need to figure out how each thread accesses the transaction history and at what points they will create branches. Is branch creation completely manual or is there an automatic detection built-in to project-m36 somewhere?
All history is removed except for the most recent branch. I would like to provide some way to provide a custom function for determining which transactions to remove or keep.

Feb 04 '19 13:02 limaner2002

Hi @limaner2002!

Is this a utility which works directly on the database files or does it run alongside the server?

One advantage of running alongside the server would be the ability to hook into STM which maintains the transaction graph. Otherwise, Project:M36 locks the transaction graph file using POSIX advisory locks. It's probably safe to truncate the past relation store files to zero as long as you don't need to time travel. That way, you wouldn't have to mess with the transaction graph at all.

Also, I hope you are using a recent version of Project:M36 which uses zip compression on the stored relations. That's not a panacea for proper storage management, but it could help.

I plan to address this shortcoming with a new storage architecture which should solve this problem and set Project:M36 apart from other databases.

Feb 06 '19 03:02 agentm

I guess the third drawback of my utility would be

It works directly on the database files and not alongside the server.

I would like to get it to work alongside the server. In fact, my use case would benefit from that. I just need to do some more digging around to figure out how to do that.

I'm using the 0.6 release of project:M36 that was released on January 9th. I believe it does have the zip compression.

I read the documentation of the new storage architecture and I am intrigued. I would like to know if there is anything I could contribute to that effort. I figure I could start out with this little utility to help myself understand how the current design works and go from there.

Feb 08 '19 01:02 limaner2002

Shoot, sorry that the compression didn't offer a temporary solution.

If you want to integrate it into the backend directly, take a look at the STM monad functions such as commitLock_. That block holds the FS lock while operating on the serialized transaction graph.

The future architecture will feature serialized graph diff thunks with simple cache management with all requests equally likely to end up the cache (thus, no fundamental "relation"-based storage).

Feb 14 '19 08:02 agentm

This should no longer be an issue as of Project:M36 0.8 with the new backend rewrite. Read more about the new data independent architecture here.

The naive persistence layer is gone.

May 26 '20 23:05 agentm

project-m36 project-m36 copied to clipboard

create a manual garbage collection utility

project-m36
project-m36 copied to clipboard