Frontend Service - goroutine (CPU & Memory) Leak
Expected Behavior
There should be no memory leak resulting from objects not being properly garbage collected.
Actual Behavior
Number of objects on Heap keeps growing. This seems to be result in slow increase of cpu & memory usage eventually resulting in outage.
Steps to Reproduce the Problem
- Start server
temporal server start-dev --port 7233 --ui-port 8233 --metrics-port 9233
-
Do not start workflows or make any grpc calls, use the sdks or the web ui. aka keep the usage to a minimum.
-
Periodically check pprof & metrics. This shows that the goroutinue counts, objects on heap (aka memory allocations for objects) & memory allocations keeps growing perpetually.
- runtime.MemStats from from http://localhost:
/debug/pprof/heap?debug=1
# runtime.MemStats
# Alloc = 65091800
# TotalAlloc = 318383929504
# Sys = 119521560
# Lookups = 0
# Mallocs = 1641446565
# Frees = 1641167975
# HeapAlloc = 65091800
# HeapSys = 82378752
# HeapIdle = 11141120
# HeapInuse = 71237632
# HeapReleased = 4464640
# HeapObjects = 278590
# Stack = 18284544 / 18284544
# MSpan = 825408 / 960048
# MCache = 19200 / 31200
# BuckHashSys = 2638467
# GCSys = 12561120
# OtherSys = 2667429
# NextGC = 69589984
# LastGC = 1721688648441657922
...
# NumGC = 10489
# NumForcedGC = 0
# GCCPUFraction = 4.835076125703521e-05
# DebugGC = false
# MaxRSS = 178909184
- Flame graph
- This was observed across local & shared & production environments. Please see the prometheus chart in a production environment where the num_goroutine count kept increasing until a restart. Notice that the "leap" appears to be isolated to the frontend service. The rest seem fine. The CPU usage & memory usage charts followed the same pattern.
Specifications
This was observed across multiple versions
- Server Version: 1.22.5, 1.22.7, 1.23.1, etc.
- Platform: Linux
- MTLS enabled
- Auth disabled
Links
This issue is potentially related to https://community.temporal.io/t/high-cpu-usage-memory-leakage-on-frontend-service/4246/1
I am also experiencing this issue.
ditto
Hello, I took a brief look into this and was not able to reproduce it locally. I left the server running for over 2 days and didn't observe go-routines growing. Plus in the screenshot you provided (from http://localhost/debug/pprof/heap?debug=1) I don't find anything abnormal. Even an idle server has periodic internal activity and it uses about 2800 goroutines. The heap and allocs count you see are total allocations (count and size) that may have been freed as well. It doesn't represent currently "in-use" allocated objects and their size.
However the last graph you linked showing num_goroutines from a production environment is a bit concerning. Since num_goroutines are reported as a guage, I initially thought applying sum operation on it will result in such graphs. But seeing this happen only in frontend and not in other services is a bit confusing. So we need to dig a little deeper. As a first step, could you please apply the avg or avg_over_time function on num_goroutines and plot the same graph? If you still see the same behavior, then we need to connect to one of the server and obtain the go-routine profile to see what is holding up all those goroutines. You can do this by port-forwarding to one of those frontend hosts and visiting the http://localhost:your_port/debug/pprof. Then click on "goroutines" link. Please let us know what you find out.
@gow thank you for the response.
Please note that this issue still persists in production & (fortunately) in our sandbox environment as well.
I'm about to enable pprof & leave it running for a couple of weeks. But first, captured:
Will post back with the same 2 charts and results of the "goroutines" link in a couple of weeks.
Thanks for posting the avg(num_goroutine) graphs @guritinvda
To enable pprof, do you have to restart the service? If not, could you please connect one of the frontend hosts in your sandbox environment and visit /debug/pprof and get the goroutine dump in its current state?
To enable pprof, do you have to restart the service?
Unfortunately, yes.
@guritinvda any new findings?
I was able to repro the issue in our sandbox environment. Blue arrow in the screenshot is when I enabled pprof and "restarted" the frontend service.
Here's the goroutine dump (2 debug levels): goroutine-1.txt goroutine-2.txt
Hey which version of temporal server or temporal SDK version are you using?
We observed this issue across multiple temporal server versions - 1.22.5, 1.22.7, 1.23.1, etc. go-SDK version - v1.24.0 (majority of the workers and clients are here). python-SDK version - 1.5.1
@uritig thanks for posting the data. Looking at the hung Goroutines, it seems to be coming from improper grpc.ClientConn cleanup.
goroutine profile: total 4204
3444 @ 0x43e5ce 0x44ea05 0x971c75 0x471661
# 0x971c74 google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run+0x114
/home/runner/go/pkg/mod/google.golang.org/[email protected]/internal/grpcsync/callback_serializer.go:76
I found a similar issue posted here on grpc-go repo that explains not closing grpc.ClientConn as the cause - https://github.com/grpc/grpc-go/issues/6413#issuecomment-1610047629
There is a PR (https://github.com/temporalio/temporal/pull/6441) in temporal to cache these connections instead of creating them frequently on demand. This issue might be fixed with that change.
I'll followup once that PR lands.
I've seen this before on the worker service. I'm not sure why we would recreate grpc.ClientConns since we have a stub-level cache that should avoid creating new ones. The new cache should only make a difference if we connect to multiple services on the same endpoint, which we generally don't do. So there might be an issue with the stub cache also.
seeing a similar behaviour with the history service where we are seeing that after a day of contineous usage the history service uses 100% of cpu available and comes back to normal for 2-3 hours after a pod restart. The pod continues to use 100% cpu till it is restarted even if the no of active workflows are less than 100.