storage RFC: Partial storage locking to increase 'storing signature' performance

This is more a rough idea and I'm wondering if you had any thoughts on this as well. It relates somehow to https://github.com/containers/storage/issues/420

We're currently locking the whole containers, images and layers directories when doing write operations on the store. I think this could be improved by the penalty of a more complex locking mechanism.

The idea is to not lock the whole layer store, but only the layers which belong to a certain image. This would allow to store two images in parallel if they do not share the same layers.

WDYT?

Aug 22 '19 10:08 saschagrunert

@nalind, WDYT?

That's roughly related to a PR over at c/image (https://github.com/containers/image/pull/611) which does per-layer locking to avoid redundantly downloading a given blob. This PR isn't merged and I doubt we ever will but the conclusion regarding the layer-locks is that it is fine as long as those locks reside on a tmpfs (the amount of lockfiles could grow high over time).

My 2 cents: Fine-granular locking can be really tricky to implement (more critical sections to lock and get wrong) and can sometimes decrease performance (more locks need to be acquired, scheduling can screw caches etc.). But I'm convinced it would speed things up here. One thing we might need to check is the risk of running into ABBA-deadlocks (in case of multi-writer copy-dependencies).

Aug 22 '19 11:08 vrothberg

Be careful about creating to many files on tmpfs, I think we found a cache leak in the kernel based on number of files left in tmpfs. @mheon would know the specifics.

Aug 22 '19 11:08 rhatdan

Ouch, yes that'll hurt. @rhatdan, any other way to do that? We need some way to clean them up.

Aug 22 '19 11:08 vrothberg

@vrothberg memory based locking?

Aug 22 '19 11:08 rhatdan

@rhatdan sounds complex in shared memory but possible.

One thing the afternoon coffee made me think is how could we migrate? Assuming we have fine-grained locking in place. How could we remain backward compatible? Other tools with older versions of c/storage might be running in parallel.

Aug 22 '19 11:08 vrothberg

Well they would still be taking the higher locks, so they could block the newer tools. The newer tools could not block the older.
I guess it would be best to have a flag that would block the older tools from running. But I am not sure we can do this.

Aug 22 '19 11:08 rhatdan

New versions could take the big lock as readers, this way old versions would block trying to acquire it as writers.

Aug 22 '19 12:08 vrothberg

Sounds like a plan, I would still keep the locks close to the storage somehow. Would something like bbolt work out for this use case?

Aug 22 '19 12:08 saschagrunert

@mheon will know. It's crucial that the locks are non-persistent (in contrast to the containers, image, layer databases etc.). Is that possible with bolt?

Aug 22 '19 12:08 vrothberg

As long as the DB lives on a tmpfs, sure - that'll wipe the DB on reboot. If you need something even less persistent than that, you can close your DB connection, remove the DB, and reopen, and that ought to resolve things.

Aug 22 '19 13:08 mheon

As long as the DB lives on a tmpfs, sure - that'll wipe the DB on reboot. If you need something even less persistent than that, you can close your DB connection, remove the DB, and reopen, and that ought to resolve things.

I think we should still store the (let me call it) lock db on ~~disk~~ a filesystem to allow multiple processed (buildah, podman) to be able to see if something is locked or not, right?

Aug 22 '19 13:08 saschagrunert

Yes, I believe so.

Aug 22 '19 13:08 rhatdan

Hmmm. So in that case, you'd want to detect reboots, and then "unlock" all the locks?

Libpod does something similar (detect a reboot via a file in tmpfs disappearing, and then "refresh" the DB by clearing nonpersistent state). That might work?

Aug 22 '19 13:08 mheon

@mheon, I'd just put the entire DB on a tmpfs.

Aug 22 '19 13:08 vrothberg

I agree putting the DB on tmpfs would work. Container/storage already relies on tmpfs for cleaning up mount counts.

Aug 22 '19 13:08 rhatdan

https://github.com/containers/image/pull/611 is placing the layer lock also on the storage tmpfs.

Aug 22 '19 13:08 vrothberg

So if we speak in terms of boltdb, then the database would have three buckets for layers, containers and images, whereas the entries indicate (per id) if it's a read or write lock, right?

Aug 22 '19 14:08 saschagrunert

I am not actually sure how we could implement process-locking with bolt. We still need file-locks, which makes me believe that we can follow the idea of containers/image#611 and put them in the storage's tmpfs and reference them by convention (e.g., layers in ./layer-locks, images in ./image-locks etc.).

Aug 22 '19 14:08 vrothberg

I am not actually sure how we could implement process-locking with bolt. We still need file-locks, which makes me believe that we can follow the idea of containers/image#611 and put them in the storage's tmpfs and reference them by convention (e.g., layers in ./layer-locks, images in ./image-locks etc.).

Sounds also good, some additional questions from my side:

Which file locks would we need additionally when trying to go via the database?
How could the convention look like, are we storing the locks per line or in some serialized hash map format?
Would it be faster than a key value store?

Aug 23 '19 08:08 saschagrunert

* Which file locks would we need additionally when trying to go via the database?

I assume we would need one lock to serialize accesses to the database (which would turn into a bottleneck). I would not know how to implement a synchronization mechanism in a database without reimplementing a mutex. That's why I feel more attached to just use file locking.

There is an open PR against storage to add a flock package (i.e., https://github.com/containers/storage/pull/378) which allows for try-lock semantics which might come in handy at some time. @rhatdan, maybe we should revive the PR again?

* How could the _convention_ look like, are we storing the locks per line or in some serialized hash map format?

In containers/image#611 we create file locks at a specific folder. The file names are the digests of a blob. We can do the same for containers, images and layers if we want to.

If we want to acquire a layer as reader, we would do an RLock() on the lock file (e.g., /storage-tmps/$storage-driver/layer-lock/$digest-of-layer.lock).

* Would it be faster than a key value store?

I bet that file locking performs better as we don't need to load the entire DB for each process but only use the lock we actually need. File locks are real synchronization primitives with proper scheduling, signaling etc. If we tried to emulate them with a DB we'd run into all kinds of issues regarding fairness, scheduling, signaling, etc.

Aug 23 '19 09:08 vrothberg

@saschagrunert @vrothberg Still something we want to pursue?

Nov 01 '19 09:11 rhatdan

Yes, but we're still waiting for feedback from @nalind :)

Nov 01 '19 10:11 saschagrunert

@saschagrunert @vrothberg Still something we want to pursue?

Yes, definitely worth pursuing. Layer-locking is something we might need to generalize to cover other scenarios as well (e.g., pull-delete races, simultaneous pulls, etc.).

Nov 01 '19 11:11 vrothberg

Also this issue is still waiting for #473 correct?

Aug 03 '20 13:08 rhatdan

Also this issue is still waiting for #473 correct?

Basically yes. Technically the features could live on their own but I see #473 as a pre-step for this enhancement.

Aug 04 '20 06:08 saschagrunert

storage storage copied to clipboard

RFC: Partial storage locking to increase 'storing signature' performance

storage
storage copied to clipboard