bug: Loss of a Query Pod causes queries to fail with indeterminant state
Search before asking
- [x] I had searched in the issues and found no similar issues.
Version
v1.2.776-nightly
What's Wrong?
Since Databend is a distributed system with query execution on multiple nodes, a loss of a node mid-query should be handled gracefully so as not to leave the database in an indeterminate state.
Currently, during query execution when a query node is lost, a "Broken Pipe" error is raised to the client. Unfortunately, it is not known what state the tables are in when that occurs.
While it might be possible to add something on the client side to try to determine the state, this would be quite complex and likely not entirely accurate or able to understand the internal state of Databend objects at the time of failure. Since Databend is a distributed system, node failures will most certainly occur on a regular basis for any sufficiently large system. Therefore, internal loss of nodes should be expected and handled transparently to outside systems/clients. We experience "Broken Pipe" errors on a daily basis.
While much less important, this bug also prevents the use of auto-scaling the query pods. The query pods can be scaled out without error but when trying to scale back in, there is a high likelihood that a Broken Pipe error will occur if the system has any use since query pods are being terminated without regard to active queries being run.
How to Reproduce?
- Create a Databend cluster with at least two query pods running on different nodes
- Execute a long running query
- Terminate one or more of the nodes hosting the query pods
- Observe the query error, most likely a "Broken Pipe"
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
@BohuTANG , @sundy-li , any insights on this at all? This is a common failure mode for us running in k8s on GCP as pods can be reaped, or even just an OOM so they are replace. We can attempt to handle application side by essentially resending the query, but it seems like a cluster should be able to handle loss of a member and failover?
Since a replacement for one of the query pods may take time to schedule and come into a ready state, perhaps the approach should be to detect a query pod loss and redistribute the execution to the remaining query pods. Not sure what the most straightforward/reliable approach would be internally to Databend.
Maybe the sql should be retried on client side?
Maybe the sql should be retried on client side?
Ultimately that is what we do now, I just added some handling in our application code to look specifically for BrokenPipe, but we never had this when running against Postgres/Greenplum.
We asked about retries through databend driver nearly a year ago - https://github.com/databendlabs/bendsql/issues/481
As clients use HTTP queries to pull data from databend server, the clients may get partial results when the query pods get changed.
So we can't retry the query inside the query server for consistency if we don't have states checkpoint.