ostree icon indicating copy to clipboard operation
ostree copied to clipboard

Repository locking 🔒

Open dbnicholson opened this issue 6 years ago • 20 comments

In #813, some simple repository locking was being considered to prevent prunes from running concurrently with checkouts and transactions. While I think that would be a great thing, I don't think that level of locking is rich enough for ostree. Since the locking isn't recursive, it would prevent both libostree and ostree programs from obtaining an exclusive lock if it would also happen somewhere else in libostree. The simple example I can think of is ostree_repo_prune_static_deltas(). You'd want to take an exclusive lock there as an entry point to libostree, but it's also called from ostree_repo_prune(), which you'd also want to take an exclusive lock for. That would simply deadlock.

The alternative I came up with here is to maintain the lock in thread local storage with a stack of lock states. Then each thread locks independently and the stack stays coherent. 8c66c5adecd504844d8ed9e2d866e7573773cf10 has that implementation with a large comment in ostree-repo.c describing how it works. The API is then a push/pop interface.

The drawback of having the lock in TLS is that it seems to prevent taking the lock asynchronously since that would likely involve having another thread acquire the lock. Once that thread completed, the lock would be dropped. So, the implementation here blocks synchronously when pushing or popping the lock. The lock timeout can be managed globally with a repo config option or from an API.

All the new APIs are experimental, but I think it would be best to make them public so ostree programs can do their own locking around a series of ostree operations.

I added a bunch of locking in places I thought it would be needed. One spot I didn't try to tackle was the core and remote config. Thinking about all the use cases got me confused, so I punted for now...

This closes https://bugzilla.gnome.org/show_bug.cgi?id=759442.

Let me know what you think! I know there's a lot here, but this issue has affected us quite a bit at Endless.

dbnicholson avatar Oct 18 '17 17:10 dbnicholson

High level looks sane at a brief glance, but there's lots of details here. (One I have is how this intersects with worker threads we create internally)

How about we avoid lots of #ifdefs here by having a private version that's used internally, then the "experimental" bit is just exposing it publicly? The private version could actually be a repo setting too so it doesn't depend on the experimental-API build?

cgwalters avatar Oct 18 '17 18:10 cgwalters

High level looks sane at a brief glance, but there's lots of details here. (One I have is how this intersects with worker threads we create internally)

Yeah, I tried to think about that a lot, but I couldn't totally come to grips with it. Basically, the worker threads are OK if you only ever acquire a shared lock. Which (I believe) is the case now. Otherwise, you have to arrange to take the lock before the worker is spawned and ensure that the worker doesn't go through a path that acquires locks.

I spent a lot of time thinking about that how to address that. I had another version that tried to detect if the lock was being acquired in the same thread by stashing the main context in use when the repo was opened and skipping if the main context in use when the lock was acquired wasn't the same. But that only tells you if the context is the same, not if the thread is the same. (I might be missing something there. I am definitely not the GMainContext guru.)

The only other thing I could think of was having a dedicated locking thread, but I then didn't know how to address the lock stack being updated out of order by multiple threads.

How about we avoid lots of #ifdefs here by having a private version that's used internally, then the "experimental" bit is just exposing it publicly?

You mean in src/ostree? I could add a cmdprivate pointer if that's better. Most of src/libostree is clean except for the headers.

The private version could actually be a repo setting too so it doesn't depend on the experimental-API build?

Not sure I follow here. What would the setting be?

dbnicholson avatar Oct 18 '17 19:10 dbnicholson

Yeah, I mean the binaries would use a cmdprivate API to do locking, which we could then enable by setting e.g.

[core]
locking=true

or so in the repo config. (Which itself we could label as experimental I guess). That'd allow us to land most of these patches and gain experience with it a bit, even on builds that don't --enable-experimental-api.

cgwalters avatar Oct 18 '17 19:10 cgwalters

Oh, well I only added actual locking in fsck. The rest of the stuff in src/ostree is just turning the --lock-timeout option into a call to ostree_repo_set_lock_timeout(), which could be cmdprivate like you say.

However, the repo config option could be used to make ostree_repo_lock_push() and ostree_repo_lock_pop() into noops. That's a good idea in case this is broken...

I also see that the CI failed immediately because I haven't actually tested without --enable-experimental-api in a while. Whoops.

dbnicholson avatar Oct 18 '17 19:10 dbnicholson

Yeah...we should probably try to move fsck into the library too. But doesn't need to block this patch.

Let's try to get this cleaned up and then I'm happy to merge it and we can do followup commits?

cgwalters avatar Oct 19 '17 16:10 cgwalters

@cgwalters Had another thought about the threads.

  • If the lock to be taken is shared, only do it when outside a transaction. Assuming that the transaction is started on a thread that will live throughout the life of the transaction, then you already have a shared lock.

  • If the lock to be taken is exclusive, add internal _unlocked versions that would be called in any place that would be run from a worker thread. So far I've only encountered ostree_repo_delete_object removing tombstones when writing out commits.

I have a bunch of fixups that I'll push out shortly that should address the comments so far.

dbnicholson avatar Oct 19 '17 17:10 dbnicholson

:umbrella: The latest upstream changes (presumably 2531d8f) made this pull request unmergeable. Please resolve the merge conflicts.

rh-atomic-bot avatar Oct 19 '17 19:10 rh-atomic-bot

OK, I added a bunch of fixups that I'll squash soon, but I wanted to give a chance to see the intermediate state if anyone was interested. I believe I addressed most of the comments so far.

TODO:

  • Add the core.locking option to globally enable or disable locking
  • Investigate locking from worker threads more
  • Look at locking for core and remote config

dbnicholson avatar Oct 19 '17 19:10 dbnicholson

:umbrella: The latest upstream changes (presumably 9166605) made this pull request unmergeable. Please resolve the merge conflicts.

rh-atomic-bot avatar Oct 20 '17 13:10 rh-atomic-bot

Are you going to have a chance to rebase this soon? If not I might be able to do it early this week.

cgwalters avatar Oct 23 '17 13:10 cgwalters

Yeah, I'll rebase later today. Some of the commits should be reordered a bit.

dbnicholson avatar Oct 23 '17 14:10 dbnicholson

:umbrella: The latest upstream changes (presumably 519b30b) made this pull request unmergeable. Please resolve the merge conflicts.

rh-atomic-bot avatar Nov 03 '17 22:11 rh-atomic-bot

Would be happy to start trying to take this - perhaps it's simplest to split this up into a PR that adds the internal locking API with just a few uses - maybe just prune and commit? Then we can gradually trickle in changes to enable it in different places like fsck as we test?

cgwalters avatar Nov 15 '17 14:11 cgwalters

Sorry, got busy with a bunch of other stuff but I should be getting back to this today or tomorrow. I'll split it up into 2 PRs like you say. Some other stuff I've thought of in the meantime:

  • A repo wide locking enable boolean. I hope this wouldn't be necessary much, but it would allow programs to opt out of locking if we run into issues with this or if there's a sequence that's just too complex for the internal locking.

  • A thread specific locking enable boolean. This is my idea for handling the worker threads. What you'd do is take a lock, spawn the thread, and immediately disable locking for that thread. When the thread completed, the spawner would pop the lock. That would allow (for example) all the pull threads to run without locking while the thread spawner has the repo locked and avoid the issues with

dbnicholson avatar Nov 15 '17 17:11 dbnicholson

I opened #1343 with the slimmed down initial setup as requested. I still passes the test suite, although I haven't really used it in real applications.

dbnicholson avatar Nov 15 '17 23:11 dbnicholson

How do you want to proceed with rebasing this? Are you planning to work on it anytime soon?

cgwalters avatar Jan 08 '18 18:01 cgwalters

Sorry, I had to take off my ostree developer hat for a while. Next week I'll get back on this and flesh it out. Does that timeline work or were you looking for something sooner?

dbnicholson avatar Jan 08 '18 20:01 dbnicholson

No pressure on the timeline! Currently locking is disabled by default. I am happy to help out a bit if we can figure out how to subdivide work. Just trying to keep things moving!

cgwalters avatar Jan 08 '18 20:01 cgwalters

@dbnicholson: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot avatar Apr 12 '21 21:04 openshift-ci-robot

@dbnicholson: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/sanity 8c9f51816c4e5dcffe66b07e3f2d6cf7e3ff803e link true /test sanity

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Apr 06 '22 20:04 openshift-ci[bot]