pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[Alerting/Monitoring] Add an API to get the server shutDownInProgress config and expose the info at tenant level.

Open MeihanLi opened this issue 3 years ago • 2 comments

Can we add an API to get the server shutDownInProgress config and expose the info at tenant level?

We recently saw that some upsert servers can become unresponsive and stop serving any queries after a server restart. Even the server is healthy, the shutDownInProgress flag can not be set back to false (ready to serve queries). This caused an incident on our side and it took us some time to dig into the broker logs to find out that some servers became unresponsive for a long time.

It would be useful if we can add an API to get server shutDownInProgress config and expose the info at tenant level so that we can proper alerting to avoid such incidents again.

related pr: https://github.com/apache/pinot/pull/8525

MeihanLi avatar Jul 28 '22 21:07 MeihanLi

cc: @yupeng9

MeihanLi avatar Jul 28 '22 21:07 MeihanLi

We can definitely add an API to return whether server is ready to serve queries (in addition to shutDownInProgress flag, we might also want to track HELIX_ENABLED which indicates if server is enabled). These info are stored in the InstanceConfig of the server in ZK.

Can you elaborate more on exposing the info at tenant level? You mean returning the percentage of servers that can serve queries?

Jackie-Jiang avatar Aug 01 '22 20:08 Jackie-Jiang