orchestrator
orchestrator copied to clipboard
Report IntermediateMaster errors under CoMaster deployment
I got two failure detections under two comaster clusters. there were UnreachableIntermediateMasterWithLaggingReplicas
and DeadIntermediateMasterAndReplicas
failures while the clusters were co-master.
Under such architecture, there should be UnreachableMaster or other co-master failure.
Then I check the analysis code: https://github.com/openark/orchestrator/blob/1a6c3cd6634ce72bb068de81b6af73691e0ce32c/go/inst/analysis_dao.go#L566-L569 https://github.com/openark/orchestrator/blob/1a6c3cd6634ce72bb068de81b6af73691e0ce32c/go/inst/analysis_dao.go#L590-L597
If LastCheckPartialSuccess
is true and syncing between two co-masters works well, then these IntermediateMaster failures will be reported instead of the co-master ones.
With syncing working well, we will get DeadIntermediateMasterAndReplicas
if two co-masters are unreachable, and get UnreachableIntermediateMasterWithLaggingReplicas
if the primary co-master is unreachable and some replicas are lagging.
LastCheckPartialSuccess
is set as true in the process of discovery SQL:
https://github.com/openark/orchestrator/blob/1a6c3cd6634ce72bb068de81b6af73691e0ce32c/go/inst/instance_dao.go#L425-L430
There should be a bug in analyzing co-master and intermediate-master failures. It might be the if-else
judgement fault.