rubix icon indicating copy to clipboard operation
rubix copied to clipboard

PrestoClusterManager wrongly infers master node as work

Open shubhamtagra opened this issue 7 years ago • 2 comments

PrestoClusterManager has a logic that if v1/node fails then the node must be a slave. This is true in ideal case but in exceptional cases this can fail on master node too e.g. we saw SocketTimeoutException in getNodes which caused master node to be inferred as slave and it returned node list as empty causing queries to fail.

We should do atleast the following:

  1. Infer node as worker only if v1/node returns 404

  2. Add retries if v1/node fails due exceptions

shubhamtagra avatar Dec 13 '17 09:12 shubhamtagra

The exception show up as serailization exception: com.fasterxml.jackson.databind.JsonMappingException: No serializer found for class java.util.concurrent.TimeoutException and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: com.google.common.collect.Values[0]->com.facebook.presto.failureDetector.Stats["lastFailureException"])

shubhamtagra avatar Dec 13 '17 17:12 shubhamtagra

It was found that the serialization exception was because of an issue in presto because of which all v1/node calls would fail. Nevertheless, we should make this change in PrestoClusterManger to make it robust but it can be taken up lower priority

shubhamtagra avatar Dec 13 '17 18:12 shubhamtagra