temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Frontend Service - goroutine (CPU & Memory) Leak

Open uritig opened this issue 1 year ago • 9 comments

Expected Behavior

There should be no memory leak resulting from objects not being properly garbage collected.

Actual Behavior

Number of objects on Heap keeps growing. This seems to be result in slow increase of cpu & memory usage eventually resulting in outage.

Steps to Reproduce the Problem

  • Start server

temporal server start-dev --port 7233 --ui-port 8233 --metrics-port 9233

  • Do not start workflows or make any grpc calls, use the sdks or the web ui. aka keep the usage to a minimum.

  • Periodically check pprof & metrics. This shows that the goroutinue counts, objects on heap (aka memory allocations for objects) & memory allocations keeps growing perpetually.

image

  • runtime.MemStats from from http://localhost:/debug/pprof/heap?debug=1
# runtime.MemStats
# Alloc = 65091800
# TotalAlloc = 318383929504
# Sys = 119521560
# Lookups = 0
# Mallocs = 1641446565
# Frees = 1641167975
# HeapAlloc = 65091800
# HeapSys = 82378752
# HeapIdle = 11141120
# HeapInuse = 71237632
# HeapReleased = 4464640
# HeapObjects = 278590
# Stack = 18284544 / 18284544
# MSpan = 825408 / 960048
# MCache = 19200 / 31200
# BuckHashSys = 2638467
# GCSys = 12561120
# OtherSys = 2667429
# NextGC = 69589984
# LastGC = 1721688648441657922
...
# NumGC = 10489
# NumForcedGC = 0
# GCCPUFraction = 4.835076125703521e-05
# DebugGC = false
# MaxRSS = 178909184

  • Flame graph

flame_graph

  • This was observed across local & shared & production environments. Please see the prometheus chart in a production environment where the num_goroutine count kept increasing until a restart. Notice that the "leap" appears to be isolated to the frontend service. The rest seem fine. The CPU usage & memory usage charts followed the same pattern.

image

Specifications

This was observed across multiple versions

  • Server Version: 1.22.5, 1.22.7, 1.23.1, etc.
  • Platform: Linux
  • MTLS enabled
  • Auth disabled

Links

This issue is potentially related to https://community.temporal.io/t/high-cpu-usage-memory-leakage-on-frontend-service/4246/1

uritig avatar Jul 22 '24 23:07 uritig

I am also experiencing this issue.

clayzermk1 avatar Jul 23 '24 19:07 clayzermk1

ditto

TheHiddenLayer avatar Jul 26 '24 08:07 TheHiddenLayer

Hello, I took a brief look into this and was not able to reproduce it locally. I left the server running for over 2 days and didn't observe go-routines growing. Plus in the screenshot you provided (from http://localhost/debug/pprof/heap?debug=1) I don't find anything abnormal. Even an idle server has periodic internal activity and it uses about 2800 goroutines. The heap and allocs count you see are total allocations (count and size) that may have been freed as well. It doesn't represent currently "in-use" allocated objects and their size.

However the last graph you linked showing num_goroutines from a production environment is a bit concerning. Since num_goroutines are reported as a guage, I initially thought applying sum operation on it will result in such graphs. But seeing this happen only in frontend and not in other services is a bit confusing. So we need to dig a little deeper. As a first step, could you please apply the avg or avg_over_time function on num_goroutines and plot the same graph? If you still see the same behavior, then we need to connect to one of the server and obtain the go-routine profile to see what is holding up all those goroutines. You can do this by port-forwarding to one of those frontend hosts and visiting the http://localhost:your_port/debug/pprof. Then click on "goroutines" link. Please let us know what you find out.

gow avatar Aug 05 '24 22:08 gow

@gow thank you for the response. Please note that this issue still persists in production & (fortunately) in our sandbox environment as well. I'm about to enable pprof & leave it running for a couple of weeks. But first, captured: image image

Will post back with the same 2 charts and results of the "goroutines" link in a couple of weeks.

uritig avatar Aug 08 '24 23:08 uritig

Thanks for posting the avg(num_goroutine) graphs @guritinvda To enable pprof, do you have to restart the service? If not, could you please connect one of the frontend hosts in your sandbox environment and visit /debug/pprof and get the goroutine dump in its current state?

gow avatar Aug 08 '24 23:08 gow

To enable pprof, do you have to restart the service?

Unfortunately, yes.

uritig avatar Aug 08 '24 23:08 uritig

@guritinvda any new findings?

gow avatar Aug 16 '24 19:08 gow

I was able to repro the issue in our sandbox environment. Blue arrow in the screenshot is when I enabled pprof and "restarted" the frontend service. image

image

Here's the goroutine dump (2 debug levels): goroutine-1.txt goroutine-2.txt

uritig avatar Aug 19 '24 17:08 uritig

Hey which version of temporal server or temporal SDK version are you using?

abhishekhugetech avatar Aug 20 '24 18:08 abhishekhugetech

We observed this issue across multiple temporal server versions - 1.22.5, 1.22.7, 1.23.1, etc. go-SDK version - v1.24.0 (majority of the workers and clients are here). python-SDK version - 1.5.1

uritig avatar Aug 28 '24 20:08 uritig

@uritig thanks for posting the data. Looking at the hung Goroutines, it seems to be coming from improper grpc.ClientConn cleanup.

goroutine profile: total 4204
3444 @ 0x43e5ce 0x44ea05 0x971c75 0x471661
#	0x971c74	google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run+0x114
/home/runner/go/pkg/mod/google.golang.org/[email protected]/internal/grpcsync/callback_serializer.go:76

I found a similar issue posted here on grpc-go repo that explains not closing grpc.ClientConn as the cause - https://github.com/grpc/grpc-go/issues/6413#issuecomment-1610047629

There is a PR (https://github.com/temporalio/temporal/pull/6441) in temporal to cache these connections instead of creating them frequently on demand. This issue might be fixed with that change.

gow avatar Aug 29 '24 19:08 gow

I'll followup once that PR lands.

gow avatar Aug 29 '24 19:08 gow

I've seen this before on the worker service. I'm not sure why we would recreate grpc.ClientConns since we have a stub-level cache that should avoid creating new ones. The new cache should only make a difference if we connect to multiple services on the same endpoint, which we generally don't do. So there might be an issue with the stub cache also.

dnr avatar Aug 29 '24 21:08 dnr

seeing a similar behaviour with the history service where we are seeing that after a day of contineous usage the history service uses 100% of cpu available and comes back to normal for 2-3 hours after a pod restart. The pod continues to use 100% cpu till it is restarted even if the no of active workflows are less than 100.

gtejasvi avatar Mar 25 '25 16:03 gtejasvi