druid Overlord & Coordinator are able to delegate to followers.

Description

Coordinator/Overlord ~~is currently a single point of failure~~ has a vertical scalability problem. I think it's time for them to start delegating to their followers.

There are some GET requests that can be delegated safely, in my opinion. Example:

GET /druid-ext/basic-security/authentication/db/*
GET /druid-ext/basic-security/authorization/db/*
GET /druid/indexer/v1/task/*/log
GET /druid/indexer/v1/task/*/reports
GET /druid/indexer/v1/task/*/segments

This PR allows Druid to scale better horizontally.

Inside RedirectFilter I added a condition to check if request is a GET request and if it satisfy a certain pattern. If yes, then let the current running server handles it.

Key changed/added classes in this PR

*RedirectInfo
RedirectFilter

This PR has:

[x] been self-reviewed.
- [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
[ ] added documentation for new or modified features or behaviors.
[ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[ ] added or updated version, license, or notice information in licenses.yaml
[ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[ ] added integration tests.
[x] been tested in a test Druid cluster.

Jun 16 '22 03:06 didip

Coordinator/Overlord is currently a single point of failure.

There is failover if the Coordinator or Overlord fails, so I wouldn't call it a SPOF. Maybe the issue is scalability. Currently the Coordinator/Overlord are only vertically scalable, and this patch makes certain APIs horizontally scalable. The general approach makes sense to me.

There's two things to watch out for:

Stale info. Right now, since we have all API access going through the leader, we don't need to worry about stale info. But if the standby servers can also serve certain APIs, then the served info can be stale.
False sense of scalability: if an API needs to use the metadata store, then scaling it out on Coordinator / Overlord does not really help, since the real bottleneck is likely to be the underlying metadata store.

I think we'll want to double check each of the APIs to ensure these two things are not going to be an issue.

I also wanted to ask: has this patch actually helped you scale out the load from certain APIs? That'd be a helpful data point when evaluating which ones we want to include.

Jun 21 '22 18:06 gianm

@gianm Thank you for taking the time to think about this proposal. And you are right, it's vertical scalability that's an issue. Thank you for correcting me on that.

And you are also correct in pointing out that if the metadata store is problematic, then it doesn't matter. That's true. But that's a fairly typical web service issue. Scale out the service first, then if the backing storage needs scaling, scale that next.

The one that we are actively testing in prod is the polling that hits this endpoint: /druid-ext/basic-security/authentication/db/%s/cachedSerializedUserMap.

Having the followers handle this request dropped the readTimeoutError to nothing since they are not busy at all. We are not worried about stale BasicAuth data because we usually configure authz of the cluster in the beginning when it was brand new and user addition after the fact is rare.

If you want me to reduce the endpoints to just /druid-ext/basic-security/authentication|authorization/db/*/cachedSerializedUserMap, I am ok with that as well. I just want to gauge your opinions/interests before I push this PR further.

Jun 21 '22 18:06 didip

The one that we are actively testing in prod is the polling that hits this endpoint: /druid-ext/basic-security/authentication/db/%s/cachedSerializedUserMap.

That's good to know. Do you know if it's actually this API that is bogging down the leader (like, in a Coordinator thread dump, do you see a lot of threads doing this specific API)? Or is the leader bogged down by some other workload, and this API is collateral damage? if it's the first case I'm surprised: it should be a really cheap API. However if it's the second case then it makes sense to me that scaling out is helpful.

BTW: if it is the second case, what happens when some server tries to contact the busy leader instead of one of the standby servers? Does it still get stuck?

We are not worried about stale BasicAuth data because we usually configure authz of the cluster in the beginning when it was brand new and user addition after the fact is rare.

I'm wondering if this is true for everyone. It may not be. An option might be a good idea: something that allows admins to scale out this API at the cost of incurring stale reads.

I just want to gauge your opinions/interests before I push this PR further.

It's certainly good to know that this patch helped in your environment. The stuff I asked about above is to figure out why exactly it did help, & see if that leads to any suggested adjustments or not.

Jun 21 '22 19:06 gianm

Btw, thanks for the details!

Jun 21 '22 19:06 gianm

@gianm It is definitely the second case: "the leader bogged down by some other workload, and this API is collateral damage". That internal API is indeed cheap and unlikely to cause issues on its own.

Since the HTTP client simply pick 1 server at random, yes, sometimes they went to the busy Coordinator and predictably, suffered from the readTimeoutException.

How should I enable the option as described here? I'm wondering if this is true for everyone. It may not be. An option might be a good idea: something that allows admins to scale out this API at the cost of incurring stale reads.

Should I add a boolean property that basically turn on/off delegation to followers or should I add an array property where users can pass their own URL patterns to be delegated?

Jun 21 '22 19:06 didip

Also, I bet Lookups suffer the same problem because of the polling mechanism. I remember seeing many readTimeoutException about Lookups but we didn't test it further because in our clusters we don't support Lookups because they are experimental.

Jun 21 '22 20:06 didip

Since the HTTP client simply pick 1 server at random, yes, sometimes they went to the busy Coordinator and predictably, suffered from the readTimeoutException.

Ah, got it, I see. Does that cause problems?

I think we can help with that too, by doing a QoSFilter for noncritical and potentially slow Coordinator APIs. Did you notice what the other threads on the leader were doing when it was being slow? If there is some pattern here, we can implement a request cap for those APIs using JettyBindings.addQosFilter, ensuring we leave capacity remaining for critical internal APIs.

Should I add a boolean property that basically turn on/off delegation to followers or should I add an array property where users can pass their own URL patterns to be delegated?

I think on/off works better. We'd want to pre-vet the APIs as being things that theoretically make sense to serve from standby servers.

Jun 21 '22 20:06 gianm

| Ah, got it, I see. Does that cause problems?

By enabling this PR, https://github.com/apache/druid/pull/12660. We never have any more issues because the local disk cache is always used.

I think we now see the real root cause in our clusters. When Overlord/Coordinator are facing GC issues, there are many GC pauses happening and that caused interruptions on all the threads causing a lot of readTimeoutException. That's why it was seemingly random when it happened. Number of thread-wise we have plenty on each server.

Jun 21 '22 20:06 didip

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

Jan 02 '24 00:01 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Jan 31 '24 00:01 github-actions[bot]