gateway icon indicating copy to clipboard operation
gateway copied to clipboard

OOM killed

Open qicz opened this issue 1 year ago • 17 comments

Description:

watch some HTTPRoute that has some error. maybe the service does not exist. the EG has been killed due to Reconcile them.

Logs: image

qicz avatar Dec 05 '23 03:12 qicz

IMO, should set RequeueAfter to requeue

qicz avatar Dec 05 '23 03:12 qicz

I did not reproduce it, can you provide the steps to reproduce it ? @qicz

Xunzhuo avatar Dec 06 '23 04:12 Xunzhuo

It could be that we are missing a valid error return on getting those resources

cnvergence avatar Dec 06 '23 15:12 cnvergence

I did not reproduce it, can you provide the steps to reproduce it ? @qicz

one HTTPRoute with service that does not exist

qicz avatar Dec 07 '23 06:12 qicz

Tried that and just HTTPRoute reported BackendNotFound, the eg works still well

Xunzhuo avatar Dec 07 '23 07:12 Xunzhuo

Tried that and just HTTPRoute reported BackendNotFound, the eg works still well

this report too often and there are more invalid HTTPRoute, the EG has been killed due to Reconcile them.

qicz avatar Dec 13 '23 06:12 qicz

@qicz I'm facing same error here. But My usage is setting about ~1300 HTTPRoute CR with about ~20 Gateway with mergeGateway=true. image

May be it's not non-exists backends cause eg oom, but the count of gateway api crs cause eg oom, I'm facing that deployment envoy-gateway pod eats too many memory. image default eg memory limit is 1g, you can change this to unlimited, but the problem is still the problem.

zzjin avatar Dec 18 '23 11:12 zzjin

@qicz in your logs, can you please paste the entire log showing the namespace and name of service, along with kubectl info on the service as well the httproute that is linking to it ?

arkodg avatar Dec 18 '23 19:12 arkodg

@qicz in your logs, can you please paste the entire log showing the namespace and name of service, along with kubectl info on the service as well the httproute that is linking to it ?

@arkodg sorry reply slowly. the namespace and service are from my company app, so they have been cleared by me. sorry for this.

qicz avatar Jan 10 '24 08:01 qicz

@qicz I'm facing same error here. But My usage is setting about ~1300 HTTPRoute CR with about ~20 Gateway with mergeGateway=true. image

May be it's not non-exists backends cause eg oom, but the count of gateway api crs cause eg oom, I'm facing that deployment envoy-gateway pod eats too many memory. image default eg memory limit is 1g, you can change this to unlimited, but the problem is still the problem.

in my case, there are only ~30 HTTPRoute. but can not set the memory to unlimited, it is bad for the Kubernetes cluster

qicz avatar Jan 10 '24 09:01 qicz

in my case, there are only ~30 HTTPRoute. but can not set the memory to unlimited, it is bad for the Kubernetes cluster

No need to be unlimited, but some thing larger for routes is enough. But as always, it must be some thing wrong with oom here.

zzjin avatar Jan 10 '24 09:01 zzjin

@qicz @zzjin can you outline steps to reproduce the problem, from this chat its hard to understand what the trigger is

arkodg avatar Jan 10 '24 18:01 arkodg

@qicz @zzjin can you outline steps to reproduce the problem, from this chat its hard to understand what the trigger is

The analysis concludes that the OOM problem is that there are many secrets and the MEM limit is not set properly.

qicz avatar Jan 17 '24 07:01 qicz

suggestion: using protobuf connect to Kubernetes to optimize the mem. xref #1596

qicz avatar Jan 17 '24 07:01 qicz

@qicz @zzjin can you outline steps to reproduce the problem, from this chat its hard to understand what the trigger is

The analysis concludes that the OOM problem is that there are many secrets and the MEM limit is not set properly.

May be that's the problem, our cluster we have about ~3000 ingress with https,witch means about ~3000 secrets.

zzjin avatar Jan 17 '24 08:01 zzjin

@qicz can you share mem stats of EG before & after https://github.com/envoyproxy/gateway/pull/1596 ?

arkodg avatar Jan 17 '24 18:01 arkodg

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] avatar Feb 16 '24 20:02 github-actions[bot]

closing due to no response, please reopen if you hit this issue again

arkodg avatar May 22 '24 23:05 arkodg

Hi @arkodg, I've hit the same issue.

It seems like the envoy gateway is creating infinite HTTPRoutes for the HTTP01 challenge, while the challenge is not satisfied. My (unproven) theory is that it is provisioning the HTTPRoute resource with generate_name instead of using a predictable name, and this causes an infinite reconciliation loop.

EDIT: by looking at the HTTPRoute owner references, this now looks like a cert-manager issue

miguelvr avatar Jul 15 '24 13:07 miguelvr

thanks for debugging this one @miguelvr , cross linking the cert-manager issue here https://github.com/cert-manager/cert-manager/issues/7176

@envoyproxy/gateway-maintainers should we consider something like envoy's overload manager where we stop reconciling more resources (flag this in a GatewayClass status) in case we hit some specified memory threshold ?

arkodg avatar Jul 15 '24 16:07 arkodg