pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Fix bug when evaluating resource status during service startup check

Open itschrispeck opened this issue 7 months ago • 1 comments

We have seen servers sometimes fail to pass the service status checker until the timeout is reached, even after all segments are online/in the expected state. Logs show:

Sleep for 10000ms as service status has not turned GOOD: MultipleCallbackServiceStatusCallback:IdealStateAndCurrentStateMatchServiceStatusCallback:Helix state does not exist, waitingFor=CurrentStateMatch, resource=table_REALTIME, numResourcesLeft=2, numTotalResources=802, minStartCount=802,;IdealStateAndExternalViewMatchServiceStatusCallback:Init;;

This is due to this check, which considers the table resource to have STARTING status if the external view/current state is null and ideal state is not. However, this isn't a valid assumption since the current state can be null if the last segment on the server is removed and the ideal state still exists. We primary see this behavior with completed segment redistribution turned on, on small tables.

The change here is meant to allow the resource status to return GOOD if the instance is no longer assigned any segment (when the server first started and collected all resources to monitor it was assigned).

itschrispeck avatar Jul 04 '24 09:07 itschrispeck