sled Subscriptions and the Pit of Success

After looking into #1152 I realized that even with the Sled issue resolved my code through mistakes of my own had fallen into what I'm going to call my Pit of Deadlocks.

The Pit of Success is an old favourite blog post for me and I wanted to look at Subscriptions through that lens. That way I hope to explore API ideas that might lead Sled users into making better subscription programming decisions by default and hopefully avoid deadlocks altogether.

My own personal Pit of Deadlocks

Any loop that follows a subscription -> IO -> update pattern is susceptible to self-deadlocking if the IO is slow enough that the subscription's buffer fills before the insert:

let subscription = Tree:::watch_prefix("my_items")
loop {
  let (key, item) = subscription.next().await;
  if item != Some("io_completed") {
    perform_io().await;
    let updated_item = "io_completed"
    db.insert(key, updated_item)?; // Deadlocked here
  }
}

The Pit of Success

My ideal subscriptions API would not deadlock in these scenarios. I'm in no position to implement most of this but I wanted to share my thoughts in the hope that it might be of some help to Sled. So what might this "Pit of Success" API look like?

Unbounded Subscriptions

As suggested by @spacejam we could mitigate the deadlocking problem by throwing RAM at it. Despite its higher risk of memory leaks I think this is the quickest way to solve my deadlocks and could be a useful escape hatch in general. This is the quick but not ideal fix.

Tree:watch_prefix_filter(key, filter_fn)

In my pit of deadlock example I am only interested in items that are not "io_completed" and this is reflective of my real world usage of subscriptions. If Tree had a method to filter events by their entry's content before they were added to the subscription buffer then my example would not be able to deadlock if I wrote instead:

let subscription = Tree:::watch_prefix_filter("my_items", |entry| entry != Some("io_completed"))

This would be my ideal solution.

Turn subscription deadlocks into a Transaction::commit error

We have three options for handling long-blocking full subscription buffers:

Block transaction commits like we do now
Error the subscriber after a timeout
Error the committer after a timeout

Erroring the subscriber means complicating the API by wrapping events in results.

Transaction errors however are already handled gracefully in many places. The pit of deadlock scenarios in my code handle transaction errors in the loop by restarting the loop and if necessary they could check for missed events at the start of the loop.

Implementing this would require changing subscribers to reserve an event at the start of commit and then sending the event to that reserved buffer space after the commit succeeded. This feels like a larger / more difficult change but it would prevent deadlocked processes from grinding to a halt.

Interactions between Async and Subscription Deadlocks

Finally, I want to note that when run in an async runtime with a limited threadpool these subscription deadlocks can and have caused the entire threadpool to become locked as each thread sooner or later hits a transaction that pushes to the deadlocked subscription. This makes the subscription deadlocks more of a concern for async applications.

Sep 03 '20 15:09 D1plo1d

Well, the code should not deadlock anymore, but I don't think it is really reasonable to expect the API to save people from themselves in this case. The send will now error out as it should and the message be skipped, but that also means that a queue filling to that extent is likely still a major semantic error on the consumer side, since the expectation might be that all messages are actually observed, even if backup occurs. In my opinion the responsibility is on the consumer to deal with backpressure in a way that is logically consistent given their constraints. You are of course free to use your own unbounded queue to buffer messages, but that is just never a good policy. It probably would be a good idea to explicitly spell out somewhere though that you should try not to block on the receiving side if you want to make sure to receive all messages. At the end of the day though, if you cannot churn through messages fast enough given the external constraints, you have to start dropping them eventually and that case always has to be handled client side anyway. Making the producing side configurable adds complexity that can easily be shifted to the consumer side in a more modular and more flexible fashion.

Sep 03 '20 16:09 soro

At the end of the day though, if you cannot churn through messages fast enough given the external constraints, you have to start dropping them eventually and that case always has to be handled client side anyway.

This is true but it does not describe the problem I am trying to solve. In my ideal solution I am proposing a filter ran before the subscription event is added to resolve deadlocks specifically - I can process all of my events if I do not deadlock and the full buffer would exert back-pressure appropriately for me if it was not also blocking the subscription task.

Making the producing side configurable adds complexity that can easily be shifted to the consumer side in a more modular and more flexible fashion.

The Sled internal implementation can be ran from the thread adding subscriber events. The external implementation would require 1 extra thread per subscriber to filter messages and re-buffer them to their consumer. Alternatively in order to build an external equivalent a Trait could be added to allow custom Subscriber types to be implemented externally.

I think it might help to give specifics use cases here:

On-device IOT DB

In my usage I am using Sled as a low throughput on-device database. So being able to process events - regardless of it it is slowed to the speed of my IO is enough to solve all of my needs. The primary reason I use Sled is not performance but type-safe persistence, transactions and subscriptions.

The deadlocks are a serious concern to me and I think I could say that for any database user but I do not need to perfectly optimize my transaction processing on a first pass. In my opinion it would be a waste of my limited 1 developer "team" to need to optimize that aspect of my application in order to take advantage of the benefits of Sled subscriptions.

Early Stage Startup DB

Nor should an early stage cloud-based startup need to optimize this - they very likely may need to later on (if their startup survives long enough that is) but optimizing early burns much needed time in these scenarios. It would be good to enable developers to write imperfect code that does not deadlock and then allow them to optimize it as needed later on.

Edit: Added type safety to my reasons to use Sled. I've got many reasons I <3 Sled but my point is that performance isn't top of that list.

Sep 03 '20 20:09 D1plo1d

Well, hopefully the deadlock is actually fixed by my patch, so that concern might be addressed. If you are fine with messages not being sent on the sled side if things back up, I'd say the expected behavior should be exactly what you want?

Sep 04 '20 11:09 soro

Maybe I'm confused then, I didn't think your patch solved the user-created deadlock in this issue. In the meantime I've put together an external workaround for myself - it's not pretty but it makes Sled subscriptions usable without deadlocks.

Btw I think my deadlocks scenario did not need the IO to deadlock and can be reduced to this (pseudocode):

let subscription = Tree:::watch_prefix("my_items")
loop {
  let (key, item) = subscription.next().await;
  if item != Some("io_completed") {
    let updated_item = "io_completed"
    db.insert(key, updated_item)?; // Deadlocked here
  }
}

Because any task that writes to it's own keys runs a risk of not being scheduled for 1024 inserts and then deadlocking itself upon resuming.

Sep 04 '20 16:09 D1plo1d

Yeah, you are right that it is too easy to deadlock hat api as is. I didn't really look at all the places this was being called and on what thread that would run - and you are right, any call to reserve or register that happens on a thread that is handling a subscription does have the potential to deadlock. I think the only real way to solve this is to not hold the locks for the duration of the send and that means either more juggling with Arcs or simply using a concurrent hashmap for that registry. Of course the easiest fix would be to make that call to send a call to try_send. Sadly it seems there is no real standard concurrent hashmap in Rust land still.

Sep 04 '20 16:09 soro

One option I'm considering is to have write operations fail with an error that clearly specifies that subscribers are not keeping up. Because the reservation happens before any modifications actually happen to the value, it's safe to bail out early if the reservation cannot happen immediately.

Sep 04 '20 17:09 spacejam

I'm not sure if that seems reasonable to me. It compromises the semantics of the database for downstream consumers of events. What if you still want to keep adding values anyway and can accept consumers not receiving messages? Extra methods also seem a bit excessive. I would say one option is to add a state field to the subscriber that is set once a write fails? That way you can handle logging etc. downstream.

Sep 04 '20 18:09 soro

One option I'm considering is to have write operations fail with an error that clearly specifies that subscribers are not keeping up. Because the reservation happens before any modifications actually happen to the value, it's safe to bail out early if the reservation cannot happen immediately.

This would solve my issues but I think we'd need the added caveat that the reservations should bail after a short a timeout. Otherwise transactions that effect 1025 or more keys of a subscription will always fail.

Edit: Err, no wait if it's trying to reserve 1025 inserts into a 1024 length buffer that will never succeed atomically. Any alternative ideas?

Sep 07 '20 17:09 D1plo1d

Similar to #1165 I see that issue being addressable by having the Event::Batch that is sent to subscribers.

Sep 10 '20 14:09 spacejam

If I understand correctly, subscriptions are a stream of all database edits with a given key prefix starting at a particular point in the db log. Given that Sled is implemented as a LSM-tree, I would guess that this same edit stream is persistently available on-disk (at least until GC comes through and sweeps up unobservable pages). Is this a reasonable summary?

If the above are reasonable statements, it seems to me that the fact that subscribers use an in-memory queue is just an implementation detail, and what they're "really" reading from is the transaction log itself (there's some interpretation here of course). From that perspective when a writer encounters a subscriber with a full queue it could just notate on the subscription that the subscriber fell behind at log position N (and skip subscribers whose fell-behind-at field is already written), and if the subscriber ever catches back up again sled could transparently read back the log from disk starting at N to fulfill the subscriber's events.

You'd still have to decide how to deal with a subscriber that fell so far behind that the GC has already collected the log at N but at least this makes the subscription use a bounded amount of memory again. (Some solutions I can think of could be adding a GC 'root' for the earliest lagging subscriber so pending events are never collected, or simply emitting a "you fell too far behind" error as mentioned earlier ITT.) One benefit is that it could allow for tuning the memory usage of subscribers by setting the queue depth (even to zero!) without worrying about how it will change the semantics of writers or subscribers.

(This comment makes some uninformed assumptions about how sled works (I'm just an observer, not a user.. yet); apologies if I missed something that invalidates it. 😅)

Dec 18 '20 04:12 infogulch