Bug: Increasing Pod Memory Usage for Push Service
What happened?
Server version 3.5.0. Push service consumed too much memory. Possible memory leak?
What did you expect to happen?
stable memory usage
How can we reproduce it (as minimally and precisely as possible)?
This is a prometheus metrics. Queried by
sum (container_memory_working_set_bytes{image!="",pod_name=~"$Pod",namespace="$namespace"}) by (pod_name)
Anything else we need to know?
No response
version
Cloud provider
OS version
Install tools
This seems a bit unusual. @FGadvancer
Updates:
After integrated push service with Pyroscope, and ran for a week, I got these stats:
Looks like push service get a lot of grpc conn in the process, then I checked the code:
func (p *Pusher) k8sOnlinePush(ctx context.Context, msg *sdkws.MsgData, pushToUserIDs []string) (wsResults []*msggateway.SingleMsgToUserResults, err error) {
for host, userIds := range usersHost { tconn, _ := p.discov.GetConn(ctx, host) usersConns[tconn] = userIds }
every push will trigger this p.discov.GetConn, thus caused too much memory takeup
I made a temp workaround: `
var usersConns = make(map[*grpc.ClientConn][]string)
for host, userIds := range usersHost {
//tconn, _ := p.discov.GetConn(ctx, host)
//usersConns[tconn] = userIds
if conn, ok := onlinePusherConnMap[host]; ok {
log.ZDebug(ctx, "DEBUG reuse local conn", "host", host)
usersConns[conn] = userIds
} else {
log.ZDebug(ctx, "DEBUG no valid local conn", "host", host)
tconn, _ := p.discov.GetConn(ctx, host)
usersConns[tconn] = userIds
onlinePusherConnMu.Lock()
//defer onlinePusherConnMu.Unlock()
log.ZDebug(ctx, "DEBUG add to local conn", "host", host)
onlinePusherConnMap[host] = tconn
onlinePusherConnMu.Unlock()
}
}
This will try to reuse conn if absent, otherwise GetConn. Not sure if this is safe solution. WDYT?
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.
this issue has fixed in release-v3.8, I recommend you update to new version. If you run into any new issues, please reopen this issue or create a new one.