[Question] Why does memory usage increase over time when running Agones SDK Health checks in a Go gRPC server?
Hi team,
I have a question regarding memory usage in a Go-based gRPC server managed by Agones.
What am I trying to understand?
- Is the steady increase in memory usage over time expected behavior when calling agones.dev.sdk.Health() every second?
- Or is there something wrong with the way we’re using the Agones SDK or gRPC client?
What I did
We deployed a lightweight gRPC server written in Go inside an Agones-managed Pod. The only thing the server does is call the Agones SDK’s Health() function once every second. No other game logic or load is running.
We observed the container’s memory usage over time using Datadog and found that memory increases linearly at a rate of approximately 1 to 3 MiB per hour.
When we commented out the Health() call, the memory growth stopped.
We also reproduced this same behavior in a C++ server calling the equivalent Agones SDK health function.
Why this matters
We intend to run these Pods continuously for several days or weeks, and this rate of memory growth is not sustainable. We'd like to know if:
- Is this a known behavior of the Agones SDK or gRPC?
- Or if we should change our usage pattern or perform cleanup
Environment
- Kubernetes: EKS 1.32
- Agones: v1.48
- Language: Go 1.23.10
- gRPC: v1.73.0
- SDK Usage: agones.dev/sdk-go and agones.dev/sdk-cpp
Any insights, recommendations, or confirmation of expected behavior would be greatly appreciated.
Thank you!
Agones: v1.35
This came out in 2023 - mind upgrading to the latest and seeing if the issue still persists?
@markmandel Thank you for checking.
Sorry, I’ve updated my previous comment — I have already confirmed that the issue still persists on v1.48. I haven’t tested versions beyond v1.48 yet.
@markmandel I also tested with Agones v1.50, but the result was the same. I’ll include the relevant code and go.mod used during testing below — would appreciate it if you could take a look.
main.go
package main
import (
"context"
"log"
"net/http"
_ "net/http/pprof"
"runtime"
"time"
sdk "agones.dev/agones/sdks/go"
)
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
go func() {
ticker := time.NewTicker(10 * time.Second)
defer ticker.Stop()
for {
select {
case <-ticker.C:
var m runtime.MemStats
runtime.ReadMemStats(&m)
log.Printf("Memory - "+
"Alloc: %d KB, "+
"TotalAlloc: %d KB, "+
"Sys: %d KB, "+
"NumGC: %d",
m.Alloc/1024,
m.TotalAlloc/1024,
m.Sys/1024,
m.NumGC,
)
}
}
}()
log.Println("Starting Agones Game Server...")
s, err := sdk.NewSDK()
if err != nil {
log.Fatalf("Could not connect to sdk: %v", err)
}
log.Println("SDK initialized successfully")
err = s.Ready()
if err != nil {
log.Fatalf("Could not send ready message: %v", err)
}
log.Println("Game Server is Ready")
ctx := context.Background()
ticker := time.NewTicker(1 * time.Second)
defer ticker.Stop()
log.Println("Starting health check loop...")
for {
select {
case <-ticker.C:
err := s.Health()
if err != nil {
log.Printf("Health check failed: %v", err)
} else {
log.Println("Health check sent successfully")
}
case <-ctx.Done():
log.Println("Context cancelled, shutting down...")
return
}
}
}
go.mod
module server
go 1.24
toolchain go1.24.4
require agones.dev/agones v1.50.0
require (
github.com/grpc-ecosystem/grpc-gateway/v2 v2.26.3 // indirect
github.com/pkg/errors v0.9.1 // indirect
golang.org/x/net v0.41.0 // indirect
golang.org/x/sys v0.33.0 // indirect
golang.org/x/text v0.26.0 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20250603155806-513f23925822 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20250603155806-513f23925822 // indirect
google.golang.org/grpc v1.73.0 // indirect
google.golang.org/protobuf v1.36.6 // indirect
)
pprof log
2025/06/23 05:38:36 Memory - Alloc: 2074 KB, TotalAlloc: 4275 KB, Sys: 12885 KB, NumGC: 17
2025/06/23 05:38:46 Memory - Alloc: 2079 KB, TotalAlloc: 4280 KB, Sys: 12885 KB, NumGC: 17
2025/06/23 05:38:56 Memory - Alloc: 2052 KB, TotalAlloc: 4286 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:06 Memory - Alloc: 2057 KB, TotalAlloc: 4290 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:16 Memory - Alloc: 2062 KB, TotalAlloc: 4295 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:26 Memory - Alloc: 2067 KB, TotalAlloc: 4300 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:36 Memory - Alloc: 2072 KB, TotalAlloc: 4305 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:46 Memory - Alloc: 2076 KB, TotalAlloc: 4310 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:56 Memory - Alloc: 2081 KB, TotalAlloc: 4315 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:06 Memory - Alloc: 2086 KB, TotalAlloc: 4319 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:16 Memory - Alloc: 2091 KB, TotalAlloc: 4324 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:26 Memory - Alloc: 2096 KB, TotalAlloc: 4329 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:36 Memory - Alloc: 2101 KB, TotalAlloc: 4334 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:46 Memory - Alloc: 2105 KB, TotalAlloc: 4339 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:56 Memory - Alloc: 2079 KB, TotalAlloc: 4344 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:06 Memory - Alloc: 2084 KB, TotalAlloc: 4349 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:16 Memory - Alloc: 2088 KB, TotalAlloc: 4354 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:26 Memory - Alloc: 2093 KB, TotalAlloc: 4359 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:36 Memory - Alloc: 2098 KB, TotalAlloc: 4364 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:46 Memory - Alloc: 2103 KB, TotalAlloc: 4369 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:56 Memory - Alloc: 2108 KB, TotalAlloc: 4373 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:06 Memory - Alloc: 2113 KB, TotalAlloc: 4378 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:16 Memory - Alloc: 2118 KB, TotalAlloc: 4384 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:26 Memory - Alloc: 2123 KB, TotalAlloc: 4388 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:36 Memory - Alloc: 2128 KB, TotalAlloc: 4393 KB, Sys: 12885 KB, NumGC: 19
Wait, the issue is with your implementing code, not with the sdkserver container? (This is what I'm seeing0
Small increases in memory over time may not be an issue, as long as it GC's effectively. SDK are the lightest of wrappers around a gRPC client.
I just ran a test with the simple game server example, that does a very similar thing:
This is the start.
After 12 hours, seems pretty stable at 33.1
After another 8 hours or so, it's up to 36.1
So quite possible there's a small memory leak in there.
Looking at gRPC, I am wondering if there's a small memory leak in the client code (there really isn't much to the SDK) https://github.com/grpc/grpc/issues/38327
Probably the only way to tell - do a proper pprof, and see where the memory map is storing data. Does seems strange that you're seeing it across C++ and Go. (Go I think is a pure implementation, not a C wrapper).
Hi @markmandel ,
Thank you very much for your thoughtful response.
I also believe that the memory growth is not coming from the sdkserver container but rather from the Go application itself — and more specifically, it’s likely related to the gRPC client behavior.
To that end, I’ve opened a separate issue here as well: grpc/grpc-go#8403. I’m hoping to get more clarity by combining insights from both discussions.
I appreciate your suggestion to analyze the memory usage with pprof. I’ll begin a more detailed investigation using it. However, I don’t have much experience with pprof, so if you’re able to assist or guide me through any part of the process, that would be incredibly helpful.
In the meantime, I’ll also look up best practices and try to proceed on my own.
Thanks again for your support!
pprof is remarkably easy to use:
- https://pkg.go.dev/runtime/pprof
- https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/ is a nice guide as well.
Hi @markmandel ,
Thank you again for your previous comment and for sharing those helpful resources.
I'm sorry, but I haven't tried pprof yet. However, I ran a long-duration test using a lightweight Go-based gRPC server with Agones, and as shown by the blue line in the attached image (representing the Go process), memory growth did not occur — likely thanks to Go’s garbage collector.
On the other hand, when testing with a C++ gRPC server, which doesn’t have a GC, memory usage kept increasing over time. To investigate further, I modified the C++ SDK in two places marked with #ifdef MODIFY, reimplementing the Health() method as a short-lived stream that creates a new connection on each call.
After this change, memory usage stopped increasing.
It seems this change helps avoid memory accumulation, but I’m not sure if this approach aligns with Agones’ intended design. Would this one-off streaming approach for Health() be considered acceptable?
Apologies for asking again, and I really appreciate your time and support.
sdks/cpp/src/agones/sdk.cc
// Copyright 2017 Google LLC All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "agones/sdk.h"
#include <grpcpp/grpcpp.h>
#include <utility>
namespace agones {
struct SDK::SDKImpl {
std::string host_;
std::shared_ptr<grpc::Channel> channel_;
std::unique_ptr<agones::dev::sdk::SDK::Stub> stub_;
std::unique_ptr<grpc::ClientWriter<agones::dev::sdk::Empty>> health_;
std::unique_ptr<grpc::ClientContext> health_context_;
};
SDK::SDK() : pimpl_{std::make_unique<SDKImpl>()} {
const char* port = std::getenv("AGONES_SDK_GRPC_PORT");
pimpl_->host_ = std::string("localhost:") + (port ? port : "9357");
pimpl_->channel_ =
grpc::CreateChannel(pimpl_->host_, grpc::InsecureChannelCredentials());
}
SDK::~SDK() {}
bool SDK::Connect() {
if (!pimpl_->channel_->WaitForConnected(
gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)))) {
std::cerr << "Could not connect to the sidecar at " << pimpl_->host_
<< ".\n";
return false;
}
pimpl_->stub_ = agones::dev::sdk::SDK::NewStub(pimpl_->channel_);
#ifdef MODIFY
// nothing
#else
// Make the health connection.
agones::dev::sdk::Empty response;
pimpl_->health_context_ =
std::unique_ptr<grpc::ClientContext>(new grpc::ClientContext);
pimpl_->health_ = pimpl_->stub_->Health(&*pimpl_->health_context_, &response);
#endif
return true;
}
grpc::Status SDK::Ready() {
grpc::ClientContext context;
context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)));
agones::dev::sdk::Empty request;
agones::dev::sdk::Empty response;
return pimpl_->stub_->Ready(&context, request, &response);
}
grpc::Status SDK::Allocate() {
grpc::ClientContext context;
context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)));
agones::dev::sdk::Empty request;
agones::dev::sdk::Empty response;
return pimpl_->stub_->Allocate(&context, request, &response);
}
grpc::Status SDK::Reserve(std::chrono::seconds seconds) {
grpc::ClientContext context;
context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)));
agones::dev::sdk::Duration request;
request.set_seconds(seconds.count());
agones::dev::sdk::Empty response;
return pimpl_->stub_->Reserve(&context, request, &response);
}
bool SDK::Health() {
agones::dev::sdk::Empty request;
#ifdef MODIFY
agones::dev::sdk::Empty response;
grpc::ClientContext context;
auto stream = pimpl_->stub_->Health(&context, &response);
// send request
if (!stream->Write(request)) {
std::cerr << "Failed to write health request";
return false;
}
// write end
stream->WritesDone();
// get result
grpc::Status status = stream->Finish();
if (!status.ok()) {
std::cerr << "Health check failed: " << status.error_message() << "";
return false;
}
return true;
#else
return pimpl_->health_->Write(request);
#endif
}
grpc::Status SDK::GameServer(agones::dev::sdk::GameServer* response) {
grpc::ClientContext context;
context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)));
agones::dev::sdk::Empty request;
return pimpl_->stub_->GetGameServer(&context, request, response);
}
grpc::Status SDK::WatchGameServer(
const std::function<void(const agones::dev::sdk::GameServer&)>& callback) {
agones::dev::sdk::Empty request;
agones::dev::sdk::GameServer gameServer;
std::unique_ptr<grpc::ClientReader<agones::dev::sdk::GameServer>> reader =
pimpl_->stub_->WatchGameServer(&watch_gs_context_, request);
while (reader->Read(&gameServer)) {
callback(gameServer);
}
return reader->Finish();
}
void SDK::CancelWatchGameServer()
{
watch_gs_context_.TryCancel();
}
grpc::Status SDK::Shutdown() {
grpc::ClientContext context;
context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)));
agones::dev::sdk::Empty request;
agones::dev::sdk::Empty response;
return pimpl_->stub_->Shutdown(&context, request, &response);
}
grpc::Status SDK::SetLabel(std::string key, std::string value) {
grpc::ClientContext context;
context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)));
agones::dev::sdk::KeyValue request;
request.set_key(std::move(key));
request.set_value(std::move(value));
agones::dev::sdk::Empty response;
return pimpl_->stub_->SetLabel(&context, request, &response);
}
grpc::Status SDK::SetAnnotation(std::string key, std::string value) {
grpc::ClientContext context;
context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
gpr_time_from_seconds(30, GPR_TIMESPAN)));
agones::dev::sdk::KeyValue request;
request.set_key(std::move(key));
request.set_value(std::move(value));
agones::dev::sdk::Empty response;
return pimpl_->stub_->SetAnnotation(&context, request, &response);
}
} // namespace agones
The other option - try rolling forward gRPC versions and see if the problem goes away. Assuming it has been fixed - if you can identify which version provides the fix, we can lock to that version in the next release as well.
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
Keeping this open until we validate it's been fixed. I expect an updated grpc will solve this. If we can identify which grpc version is good, we can hardcode it until the Kubernetes updates upgrade us to it.