open-match icon indicating copy to clipboard operation
open-match copied to clipboard

Broken connectivity (context canceled) with istio 1.10

Open iffyio opened this issue 3 years ago • 1 comments

What happened: We hit an issue with OM after we upgraded istio in our cluster to v1.10.3 where connectivity to open-match components sporadically becomes flaky after om is redeployed and can last for up to hours. rpc calls from e.g our match-function to om-query return context cancelled .

from queryService.QueryTickets: rpc error: code = Canceled desc = context canceled

and that has a snowball effect on the rest of the matchmaker.

What you expected to happen: OM should always be reachable after redeployments

How to reproduce it (as minimally and precisely as possible): The issue doesn't happen consistently but it only ever happens after om is redeployed, on an istio v1.10.3 enabled cluster.

Anything else we need to know?: We run istio's envoy sidecars alongside every workload in the cluster so in this instance it was very likely that the envoy proxy next to the om-query process wasn't handling the request properly.

The issue doesn't happen consistently but it only ever happens after om is redeployed and wasn't an issue with our previous istio installations until the recent upgrade (we did an upgrade from 1.7.3 -> 1.10.3 so its unclear in which particular istio version this broke). When it occurs however, initially pretty much all requests to om-query fail but gradually some requests begin to get through successfully and eventually it seems the cluster enters a stable state.

So we suspected om's usage of headless services because that was the only difference between om and other workloads in the cluster and sure enough, switching the services to a regular clusterIP assigned service seems to resolve the issue.

Now we're looking to run om without headless service but then looking at #1183, it sounds an awfully similar root cause that would suggest something between om's grpc clientside loadbalancing impl and newer versions of istio's envoy isn't playing together nicely.

I think I'm wondering what you're thoughts are on this issue and our switching away from headless service? Would it be possible for you folks to e.g make the custom grpc loadbalancing optional so that environments that don't need it can avoid potential issues in the future?

Output of kubectl version:

Cloud Provider/Platform (AKS, GKE, Minikube etc.): GKE

Open Match Release Version: openmatch version v1.0.0

Install Method(yaml/helm):: yaml

iffyio avatar Sep 10 '21 13:09 iffyio

Hi @iffyio. We're getting through the outstanding issues slowly and are wondering if you have come up with a solution since the opening of this issue. We would appreciate any findings that help resolve if you did find a solution. We noticed that your Istio version went through three minor changes when upgrading and likely something that might've been the culprit but likely there has been additional upgrades since. Any updates or findings would help us either resolve or close this issue. If you happened to resolve, a PR would be appreciated

syntxerror avatar Aug 02 '22 14:08 syntxerror

Just wanted to add to this: this has happened to us, and we don't have istio, envoy or anything else other than OpenMatch. I haven't yet investigated it thoroughly, and don't know how to reproduce it, will maybe try to intentionally cause a fail-over in the redis sentinel installation, and will update if I find something.

ninetyninereds avatar Aug 15 '22 09:08 ninetyninereds

Hey @ninetyninereds, May I know if you have reached to some conclusion w.r.t this issue or able to reproduce it. Please let us know if we should keep this issue open.

mridulji avatar Sep 01 '22 07:09 mridulji

@mridulji I haven't yet had time to try and reproduce it - if/when I will, I can always re-open a new issue with a more accurate description of the problem - right now I have no useful information to share.

ninetyninereds avatar Sep 01 '22 11:09 ninetyninereds

Thanks for your reply. Marking this issue as closed.

mridulji avatar Sep 01 '22 12:09 mridulji