client-go memCacheClient: cached transient error leads to resource lookup failures

An error http2: client connection force closed via ClientConn.Close has occured for some reason (maybe api-server was under load). This lead our deployment tool (helm based one, which makes use of memCacheClient) to print some errors and fail further resource lookup requests by group-version.

My debug showed that it could occur if these errored responses has been cached and not renewed during further lookups. Possibly it is a failure of the isTransientError helper, which does not take into account http2: ... ClientConn.Close error .

Logs:

...
E0603 13:19:34.358760   15514 memcache.go:196] couldn't get resource list for networking.k8s.io/v1beta1: Get "https://domain/apis/networking.k8s.io/v1beta1?timeout=32s": http2: client connection force closed via ClientConn.Close
E0603 13:19:34.358768   15514 memcache.go:196] couldn't get resource list for scheduling.k8s.io/v1: Get "https://domain/apis/scheduling.k8s.io/v1?timeout=32s": http2: client connection force closed via ClientConn.Close
E0603 13:19:34.358783   15514 memcache.go:196] couldn't get resource list for deckhouse.io/v1alpha2: Get "https://domain/apis/deckhouse.io/v1alpha2?timeout=32s": http2: client connection force closed via ClientConn.Close
E0603 13:19:34.358849   15514 memcache.go:196] couldn't get resource list for coordination.k8s.io/v1beta1: Get "https://apidomain/apis/coordination.k8s.io/v1beta1?timeout=32s": http2: client connection force closed via ClientConn.Close
Error: helm templates rendering failed: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "Deployment" in version "apps/v1"

Screenshot from 2022-06-08 14-55-12

Jun 08 '22 12:06 distorhead

Some more debug

There is strange behaviour of our helm-based tool (which makes use of cli-runtime and client-go in turn): internal error was shadowed and resulted into related resource lookup error.

Debug reveals that this error shadowing explained by this piece of code (https://github.com/kubernetes/client-go/blob/master/restmapper/discovery.go#L151):

func GetAPIGroupResources(cl discovery.DiscoveryInterface) ([]*APIGroupResources, error) {
	gs, rs, err := cl.ServerGroupsAndResources()
	if rs == nil || gs == nil {
		return nil, err
		// TODO track the errors and update callers to handle partial errors.
	}

— partial errors are ignored. Also errors ignored in the case when rs or gs are not nil, but contain zero elements.

Jun 08 '22 14:06 distorhead

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 06 '22 14:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 06 '22 15:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Nov 05 '22 15:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 05 '22 15:11 k8s-ci-robot

client-go client-go copied to clipboard

memCacheClient: cached transient error leads to resource lookup failures

client-go
client-go copied to clipboard