microservices-demo
microservices-demo copied to clipboard
Possible memory leak in NodeJS / Python services
Uptime checks for the production deployment of OnlineBoutique have been failing once every few weeks. Looking at kubectl events
timed with an uptime check failure --
38m Warning NodeSysctlChange node/gke-online-boutique-mast-default-pool-65a22575-azeq {"unmanaged": {"net.ipv4.tcp_fastopen_key": "004baa97-3c3b554d-9bbcccf8-870ced36"}}
43m Warning NodeSysctlChange node/gke-online-boutique-mast-default-pool-65a22575-i6m8 {"unmanaged": {"net.ipv4.tcp_fastopen_key": "706b7d5f-9df4b412-e8eb875e-179c4765"}}
46m Warning NodeSysctlChange node/gke-online-boutique-mast-default-pool-65a22575-jvwz {"unmanaged": {"net.ipv4.tcp_fastopen_key": "a0f734c5-5c9a56e1-06aeb420-0010498e"}}
39m Warning OOMKilling node/gke-online-boutique-mast-default-pool-65a22575-jvwz Memory cgroup out of memory: Kill process 569290 (node) score 2181 or sacrifice child
Killed process 569290 (node) total-vm:1418236kB, anon-rss:121284kB, file-rss:33236kB, shmem-rss:0kB
39m Warning OOMKilling node/gke-online-boutique-mast-default-pool-65a22575-jvwz Memory cgroup out of memory: Kill process 2592522 (grpc_health_pro) score 1029 or sacrifice child
Killed process 2592530 (grpc_health_pro) total-vm:710956kB, anon-rss:1348kB, file-rss:7376kB, shmem-rss:0kB
It looks like memory requests are exceeding their limit. There seems to be plenty of allocatable memory across the prod GKE nodes
But as observed by @bourgeoisor, it seems that three of the workloads are using steadily increasing amounts of memory until the pods are killed by GKE.
Currency and payment (NodeJS):
![Screen Shot 2021-05-03 at 2 23 13 PM](https://user-images.githubusercontent.com/3137106/116916345-32b8d300-ac1b-11eb-9643-c007f254f42b.png)
Recommendation: (Python)
![Screen Shot 2021-05-03 at 2 24 16 PM](https://user-images.githubusercontent.com/3137106/116916384-4106ef00-ac1b-11eb-9a3b-0bd7998fcbd5.png)
TODO - investigate possible memory leaks starting with the NodeJS services. Investigate why the services use an increasing amount of memory over time rather than a constant amount. Then investigate the Python services + see if other python services (emailservice, for instance) show the same behavior as recommendation service.
Was unable to find the root cause of the NodeJS memory leak after a few weeks of testing. Needs a Node expert or someone else to further investigate. Internal doc with my notes so far: https://docs.google.com/document/d/1gyc8YvfKwMr86wzY_cz1NICQU48VE-wXifqjDprAafI/edit?resourcekey=0-g04_Kba4MQjeXDFzsp-Bqw
According to the profiler data for the currencyservice
and serviceservice
the request-retry
package is the one that seems to be using a lot of memory. It is imported by the google-cloud/common
library that is used by google-cloud/tracing
, google-cloud/debug
and google-cloud/profiler
.
The same behaviour is reported in the google-clou/debug nodejs
repository. As per this recent comment the issue seems to have been eradicated after disabling google-cloud/debug
I have created for PRs to stage 4 clusters with different settings to observe how the memory usage is over time.
- #637 - has no
google-cloud/debug
- #638 - has no
google-cloud/trace
- #639 - has no
google-cloud/profiler
- #640 - has all three of the above disabled
So the issue clearly seems like it's with any library that uses google-cloud/common
. In our case google-cloud/debug
and google-cloud/tracing
. See the memory graphs for the four cases described in the earlier PR. So ideally we would have to wait for the fix for https://github.com/googleapis/cloud-debug-nodejs/issues/811
One more thing that was noticed is that the google-cloud/debug
was erroring out with a bunch of insufficient scopes
errors:
This is because the Cloud Debugger API
access scope is not granted for the online-boutique-pr
and online-boutique-master
cluster nodes. Thus, we should create the clusters with --scopes=https://www.googleapis.com/auth/cloud_debugger,gke-default
in order for the debug agent to be able to connect to API.
I have created a new cluster online-boutique-pr-v2
with the above mentioned scopes and updated the GitHub CI workflows to use the new cluster. The changes can be viewed in #644
This takes care of all the insufficient scopes
errors that were observed but does not seem to fully eradicate the memory issue. This change seems to delay the time it takes for the memory to hit the peak by ~1.5 hours.
I create two PRs two generate some profiler data in the CI project for this repo.
- #653 has the
google-cloud-debug
agent enabled - #654 has no
google-cloud-debug
agent enabled
These PR had different version tags for the profiler agent
in the currencyservice
- #653 -->
vNew-HasDebug
- #654 -->
vNew-NoDebug
You can view the profiler data for these versions under the profiler view in the CI project. Filter by the following criteria and you use it to understand the differences.
Hi @Shabirmean,
Please correct me if I'm wrong. We are now just waiting on this issue to be fixed via https://github.com/googleapis/cloud-debug-nodejs/issues/811. Judging from Ben Coe's comment, this is something they plan to fix.
Let me know if there is any action we need to take in the meantime.
Hi @Shabirmean,
Please correct me if I'm wrong. We are now just waiting on this issue to be fixed via googleapis/cloud-debug-nodejs#811. Judging from Ben Coe's comment, this is something they plan to fix.
Let me know if there is any action we need to take in the meantime.
Hello @NimJay
There isn't much we can do from our side. I have communicated with Ben and seeing if we can work with the debug team to get that issue (https://github.com/googleapis/cloud-debug-nodejs/issues/811) fixed. Until then, no action is needed/possible from our side. I suggest we keep this issue open!
This is still an issue, but bumping priority down to p3
Now that https://github.com/GoogleCloudPlatform/microservices-demo/pull/1281 is merged into the main
branch we could close this issue. Cloud Debugger is now removed from this project.