orchestrator icon indicating copy to clipboard operation
orchestrator copied to clipboard

Report IntermediateMaster errors under CoMaster deployment

Open ZhangJiaQiao opened this issue 1 year ago • 0 comments

I got two failure detections under two comaster clusters. there were UnreachableIntermediateMasterWithLaggingReplicas and DeadIntermediateMasterAndReplicas failures while the clusters were co-master.

image

Under such architecture, there should be UnreachableMaster or other co-master failure.

Then I check the analysis code: https://github.com/openark/orchestrator/blob/1a6c3cd6634ce72bb068de81b6af73691e0ce32c/go/inst/analysis_dao.go#L566-L569 https://github.com/openark/orchestrator/blob/1a6c3cd6634ce72bb068de81b6af73691e0ce32c/go/inst/analysis_dao.go#L590-L597

If LastCheckPartialSuccess is true and syncing between two co-masters works well, then these IntermediateMaster failures will be reported instead of the co-master ones. With syncing working well, we will get DeadIntermediateMasterAndReplicas if two co-masters are unreachable, and get UnreachableIntermediateMasterWithLaggingReplicas if the primary co-master is unreachable and some replicas are lagging.

LastCheckPartialSuccess is set as true in the process of discovery SQL: https://github.com/openark/orchestrator/blob/1a6c3cd6634ce72bb068de81b6af73691e0ce32c/go/inst/instance_dao.go#L425-L430

There should be a bug in analyzing co-master and intermediate-master failures. It might be the if-else judgement fault.

ZhangJiaQiao avatar Mar 23 '23 07:03 ZhangJiaQiao