microservices-demo icon indicating copy to clipboard operation
microservices-demo copied to clipboard

Possible memory leak in NodeJS / Python services

Open askmeegs opened this issue 3 years ago • 8 comments

Uptime checks for the production deployment of OnlineBoutique have been failing once every few weeks. Looking at kubectl events timed with an uptime check failure --

38m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-azeq   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "004baa97-3c3b554d-9bbcccf8-870ced36"}}
43m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-i6m8   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "706b7d5f-9df4b412-e8eb875e-179c4765"}}
46m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-jvwz   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "a0f734c5-5c9a56e1-06aeb420-0010498e"}}
39m         Warning   OOMKilling         node/gke-online-boutique-mast-default-pool-65a22575-jvwz   Memory cgroup out of memory: Kill process 569290 (node) score 2181 or sacrifice child
Killed process 569290 (node) total-vm:1418236kB, anon-rss:121284kB, file-rss:33236kB, shmem-rss:0kB
39m         Warning   OOMKilling         node/gke-online-boutique-mast-default-pool-65a22575-jvwz   Memory cgroup out of memory: Kill process 2592522 (grpc_health_pro) score 1029 or sacrifice child
Killed process 2592530 (grpc_health_pro) total-vm:710956kB, anon-rss:1348kB, file-rss:7376kB, shmem-rss:0kB

It looks like memory requests are exceeding their limit. There seems to be plenty of allocatable memory across the prod GKE nodes Screen Shot 2021-05-03 at 1 55 56 PM

But as observed by @bourgeoisor, it seems that three of the workloads are using steadily increasing amounts of memory until the pods are killed by GKE.

Currency and payment (NodeJS):

Screen Shot 2021-05-03 at 2 23 13 PM

Recommendation: (Python)

Screen Shot 2021-05-03 at 2 24 16 PM

TODO - investigate possible memory leaks starting with the NodeJS services. Investigate why the services use an increasing amount of memory over time rather than a constant amount. Then investigate the Python services + see if other python services (emailservice, for instance) show the same behavior as recommendation service.

askmeegs avatar May 03 '21 18:05 askmeegs

Was unable to find the root cause of the NodeJS memory leak after a few weeks of testing. Needs a Node expert or someone else to further investigate. Internal doc with my notes so far: https://docs.google.com/document/d/1gyc8YvfKwMr86wzY_cz1NICQU48VE-wXifqjDprAafI/edit?resourcekey=0-g04_Kba4MQjeXDFzsp-Bqw

askmeegs avatar Sep 13 '21 16:09 askmeegs

According to the profiler data for the currencyservice and serviceservice the request-retry package is the one that seems to be using a lot of memory. It is imported by the google-cloud/common library that is used by google-cloud/tracing, google-cloud/debug and google-cloud/profiler.

The same behaviour is reported in the google-clou/debug nodejs repository. As per this recent comment the issue seems to have been eradicated after disabling google-cloud/debug

I have created for PRs to stage 4 clusters with different settings to observe how the memory usage is over time.

  • #637 - has no google-cloud/debug
  • #638 - has no google-cloud/trace
  • #639 - has no google-cloud/profiler
  • #640 - has all three of the above disabled

image

Shabirmean avatar Nov 26 '21 20:11 Shabirmean

So the issue clearly seems like it's with any library that uses google-cloud/common. In our case google-cloud/debug and google-cloud/tracing. See the memory graphs for the four cases described in the earlier PR. So ideally we would have to wait for the fix for https://github.com/googleapis/cloud-debug-nodejs/issues/811

Screen Shot 2021-11-27 at 7 19 33 AM

Shabirmean avatar Nov 29 '21 15:11 Shabirmean

One more thing that was noticed is that the google-cloud/debug was erroring out with a bunch of insufficient scopes errors:

image

This is because the Cloud Debugger API access scope is not granted for the online-boutique-pr and online-boutique-master cluster nodes. Thus, we should create the clusters with --scopes=https://www.googleapis.com/auth/cloud_debugger,gke-default in order for the debug agent to be able to connect to API.

I have created a new cluster online-boutique-pr-v2 with the above mentioned scopes and updated the GitHub CI workflows to use the new cluster. The changes can be viewed in #644

This takes care of all the insufficient scopes errors that were observed but does not seem to fully eradicate the memory issue. This change seems to delay the time it takes for the memory to hit the peak by ~1.5 hours.

image

Shabirmean avatar Nov 29 '21 15:11 Shabirmean

I create two PRs two generate some profiler data in the CI project for this repo.

  • #653 has the google-cloud-debug agent enabled
  • #654 has no google-cloud-debug agent enabled

These PR had different version tags for the profiler agent in the currencyservice

  • #653 --> vNew-HasDebug
  • #654 --> vNew-NoDebug

You can view the profiler data for these versions under the profiler view in the CI project. Filter by the following criteria and you use it to understand the differences.

image

Shabirmean avatar Dec 14 '21 15:12 Shabirmean

Hi @Shabirmean,

Please correct me if I'm wrong. We are now just waiting on this issue to be fixed via https://github.com/googleapis/cloud-debug-nodejs/issues/811. Judging from Ben Coe's comment, this is something they plan to fix.

Let me know if there is any action we need to take in the meantime.

NimJay avatar Jan 12 '22 20:01 NimJay

Hi @Shabirmean,

Please correct me if I'm wrong. We are now just waiting on this issue to be fixed via googleapis/cloud-debug-nodejs#811. Judging from Ben Coe's comment, this is something they plan to fix.

Let me know if there is any action we need to take in the meantime.

Hello @NimJay

There isn't much we can do from our side. I have communicated with Ben and seeing if we can work with the debug team to get that issue (https://github.com/googleapis/cloud-debug-nodejs/issues/811) fixed. Until then, no action is needed/possible from our side. I suggest we keep this issue open!

Shabirmean avatar Jan 12 '22 21:01 Shabirmean

This is still an issue, but bumping priority down to p3

bourgeoisor avatar May 06 '22 19:05 bourgeoisor

Now that https://github.com/GoogleCloudPlatform/microservices-demo/pull/1281 is merged into the main branch we could close this issue. Cloud Debugger is now removed from this project.

mathieu-benoit avatar Nov 29 '22 13:11 mathieu-benoit