rubix
rubix copied to clipboard
PrestoClusterManager wrongly infers master node as work
PrestoClusterManager has a logic that if v1/node fails then the node must be a slave. This is true in ideal case but in exceptional cases this can fail on master node too e.g. we saw SocketTimeoutException in getNodes which caused master node to be inferred as slave and it returned node list as empty causing queries to fail.
We should do atleast the following:
-
Infer node as worker only if v1/node returns 404
-
Add retries if v1/node fails due exceptions
The exception show up as serailization exception: com.fasterxml.jackson.databind.JsonMappingException: No serializer found for class java.util.concurrent.TimeoutException and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: com.google.common.collect.Values[0]->com.facebook.presto.failureDetector.Stats["lastFailureException"])
It was found that the serialization exception was because of an issue in presto because of which all v1/node calls would fail. Nevertheless, we should make this change in PrestoClusterManger to make it robust but it can be taken up lower priority