consul icon indicating copy to clipboard operation
consul copied to clipboard

proxycfg: check Ingress Gateway watches during discovery chain

Open adrien-f opened this issue 3 years ago • 3 comments
trafficstars

Description

Greetings, I'm not sure if you still accept fixes for the v1.10 branch but I thought you might find this fix useful.

We've been suffering from a memory leak on our Consul Clients hosting the Ingress Gateway service and only them. After investigation, we found out that the number of goroutines was ever-growing:

image

We went to take some profiling dumps and found out a lot of goroutines were parked:

image

And the heap was taken by watchDiscoveryChain and notifyBlockingQuery:

image

This leads us to look into agent/proxycfg/state.go:

https://github.com/hashicorp/consul/blob/de95b6f546f7f1b891cfd596dd4e3293a3a3abcd/agent/proxycfg/state.go#L1819-L1851

I immediately see at the beginning a key existence check for snap.ConnectProxy.WatchedDiscoveryChains, and then, when creating the watch request, an assignment in either snap.IngressGateway.WatchedDiscoveryChains or snap.ConnectProxy.WatchedDiscoveryChains.

This made us add the extra check for existence in snap.IngressGateway.WatchedDiscoveryChains. Since then, we've observed a pretty stable goroutines amount and no more OOMKill:

image

We are not in the right configuration to upgrade yet from v1.10 which is why we hope this could still be useful for the community if anyone else is also experiencing this issue.

Testing & Reproduction steps

  • In the case of bugs, describe how to replicate
  • If any manual tests were done, document the steps and the conditions to replicate
  • Call out any important/ relevant unit tests, e2e tests or integration tests you have added or are adding

Links

Include any links here that might be helpful for people reviewing your PR (Tickets, GH issues, API docs, external benchmarks, tools docs, etc). If there are none, feel free to delete this section.

Please be mindful not to leak any customer or confidential information. HashiCorp employees may want to use our internal URL shortener to obfuscate links.

PR Checklist

  • [ ] updated test coverage
  • [ ] external facing docs updated
  • [x] not a security concern

adrien-f avatar Jun 30 '22 15:06 adrien-f

CLA assistant check
All committers have signed the CLA.

hashicorp-cla avatar Jun 30 '22 15:06 hashicorp-cla

Hey @adrien-f

Thank you for this PR and the detailed explanation ( that always helps! ). I'll have to get an answer back to you if we support changes for this branch. I'm pretty sure we do, but i'll have to confirm. While I figure that out for you could you sign the CLA agreement so that our tests can run and we could add this to the review queue 😄 ?

Amier3 avatar Jul 05 '22 15:07 Amier3

👋 Hey @Amier3, sorry it took some time to get the CLA approval with the summer holidays!

adrien-f avatar Aug 04 '22 12:08 adrien-f

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

github-actions[bot] avatar Oct 04 '22 01:10 github-actions[bot]

Closing due to inactivity. If you feel this was a mistake or you wish to re-open at any time in the future, please leave a comment and it will be re-surfaced for the maintainers to review.

github-actions[bot] avatar Nov 19 '22 01:11 github-actions[bot]