trow icon indicating copy to clipboard operation
trow copied to clipboard

Create Internal Data Structure / Cache

Open amouat opened this issue 5 years ago • 1 comments

At the moment the file system is effectively the database and source of all truth for Trow.

I think it's important that all data is captured in the filesystem, allowing Trow to be easily copied, restarted, backed-up etc. But there needs to be an internal in-memory data-structure to allow fast updates, statistics, metrics, checks etc.

This was something that was discussed a lot in the early days of trow, when the overall design was being discussed. The hope was that we would be able to use CRDTs to support a distributed Trow network with multiple Trow backend instances. Each instance may have a different set of images and would need to coordinate with peers, hence the CRDT. It's important to keep such use cases in mind as we build out the data structure, so that it remains possible to implement them. The CRDT may or may not be part of the internal data structure but I think it's relevant to think about here.

Another thought I had was that the contents of the registry can be expressed as a SHA of the manifests directory. Each subdir should have it's own SHA and the manifest directory then has a sha of its subdirs. (I later realised I was describing a sort of merkle tree). Keeping these checksums allows us to verify that state hasn't changed unexpectedely. For example it could be written out on registry shutdown and used to verify contents when restarted (or there could be random/requested verification runs).

The blobs directory should also have a SHA representing all the contents. We can then do the following:

- get all blob SHAs from _manifests_ and munge into single SHA ("all manifest digest") somehow (would require ordering)
- get all SHAs from _blob folder_ and munge into single SHA ("all blob digest")
- if they match, we know the registry is in sync (no dangling blobs).
- we should also verify each SHA matches the blob

The SHAs should be returned from an endpoint as the current repo state. Verify command could be triggered by API or on startup.

amouat avatar Aug 21 '20 09:08 amouat

Also see mini-trow approach which puts a full JSON object in files rather than just a digest pointer.

amouat avatar Dec 07 '20 16:12 amouat