agones icon indicating copy to clipboard operation
agones copied to clipboard

[Question] Why does memory usage increase over time when running Agones SDK Health checks in a Go gRPC server?

Open galileo1721 opened this issue 6 months ago • 11 comments

Hi team,

I have a question regarding memory usage in a Go-based gRPC server managed by Agones.

What am I trying to understand?

  • Is the steady increase in memory usage over time expected behavior when calling agones.dev.sdk.Health() every second?
  • Or is there something wrong with the way we’re using the Agones SDK or gRPC client?

What I did

We deployed a lightweight gRPC server written in Go inside an Agones-managed Pod. The only thing the server does is call the Agones SDK’s Health() function once every second. No other game logic or load is running.

We observed the container’s memory usage over time using Datadog and found that memory increases linearly at a rate of approximately 1 to 3 MiB per hour.

When we commented out the Health() call, the memory growth stopped.

We also reproduced this same behavior in a C++ server calling the equivalent Agones SDK health function.

Why this matters

We intend to run these Pods continuously for several days or weeks, and this rate of memory growth is not sustainable. We'd like to know if:

  • Is this a known behavior of the Agones SDK or gRPC?
  • Or if we should change our usage pattern or perform cleanup

Environment

  • Kubernetes: EKS 1.32
  • Agones: v1.48
  • Language: Go 1.23.10
  • gRPC: v1.73.0
  • SDK Usage: agones.dev/sdk-go and agones.dev/sdk-cpp

Any insights, recommendations, or confirmation of expected behavior would be greatly appreciated.

Thank you!

galileo1721 avatar Jun 22 '25 23:06 galileo1721

Agones: v1.35

This came out in 2023 - mind upgrading to the latest and seeing if the issue still persists?

markmandel avatar Jun 23 '25 00:06 markmandel

@markmandel Thank you for checking.

Sorry, I’ve updated my previous comment — I have already confirmed that the issue still persists on v1.48. I haven’t tested versions beyond v1.48 yet.

galileo1721 avatar Jun 23 '25 01:06 galileo1721

@markmandel I also tested with Agones v1.50, but the result was the same. I’ll include the relevant code and go.mod used during testing below — would appreciate it if you could take a look.

main.go


package main

import (
	"context"
	"log"
	"net/http"
	_ "net/http/pprof"
	"runtime"
	"time"

	sdk "agones.dev/agones/sdks/go"
)

func main() {
	go func() {
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()

	go func() {
		ticker := time.NewTicker(10 * time.Second)
		defer ticker.Stop()

		for {
			select {
			case <-ticker.C:
				var m runtime.MemStats
				runtime.ReadMemStats(&m)
				log.Printf("Memory - "+
					"Alloc: %d KB, "+
					"TotalAlloc: %d KB, "+
					"Sys: %d KB, "+
					"NumGC: %d",
					m.Alloc/1024,
					m.TotalAlloc/1024,
					m.Sys/1024,
					m.NumGC,
				)
			}
		}
	}()

	log.Println("Starting Agones Game Server...")

	s, err := sdk.NewSDK()
	if err != nil {
		log.Fatalf("Could not connect to sdk: %v", err)
	}

	log.Println("SDK initialized successfully")

	err = s.Ready()
	if err != nil {
		log.Fatalf("Could not send ready message: %v", err)
	}
	log.Println("Game Server is Ready")

	ctx := context.Background()

	ticker := time.NewTicker(1 * time.Second)
	defer ticker.Stop()

	log.Println("Starting health check loop...")

	for {
		select {
		case <-ticker.C:
			err := s.Health()
			if err != nil {
				log.Printf("Health check failed: %v", err)
			} else {
				log.Println("Health check sent successfully")
			}

		case <-ctx.Done():
			log.Println("Context cancelled, shutting down...")
			return
		}
	}
}

go.mod


module server

go 1.24

toolchain go1.24.4

require agones.dev/agones v1.50.0

require (
	github.com/grpc-ecosystem/grpc-gateway/v2 v2.26.3 // indirect
	github.com/pkg/errors v0.9.1 // indirect
	golang.org/x/net v0.41.0 // indirect
	golang.org/x/sys v0.33.0 // indirect
	golang.org/x/text v0.26.0 // indirect
	google.golang.org/genproto/googleapis/api v0.0.0-20250603155806-513f23925822 // indirect
	google.golang.org/genproto/googleapis/rpc v0.0.0-20250603155806-513f23925822 // indirect
	google.golang.org/grpc v1.73.0 // indirect
	google.golang.org/protobuf v1.36.6 // indirect
)

pprof log


2025/06/23 05:38:36 Memory - Alloc: 2074 KB, TotalAlloc: 4275 KB, Sys: 12885 KB, NumGC: 17
2025/06/23 05:38:46 Memory - Alloc: 2079 KB, TotalAlloc: 4280 KB, Sys: 12885 KB, NumGC: 17
2025/06/23 05:38:56 Memory - Alloc: 2052 KB, TotalAlloc: 4286 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:06 Memory - Alloc: 2057 KB, TotalAlloc: 4290 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:16 Memory - Alloc: 2062 KB, TotalAlloc: 4295 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:26 Memory - Alloc: 2067 KB, TotalAlloc: 4300 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:36 Memory - Alloc: 2072 KB, TotalAlloc: 4305 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:46 Memory - Alloc: 2076 KB, TotalAlloc: 4310 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:39:56 Memory - Alloc: 2081 KB, TotalAlloc: 4315 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:06 Memory - Alloc: 2086 KB, TotalAlloc: 4319 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:16 Memory - Alloc: 2091 KB, TotalAlloc: 4324 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:26 Memory - Alloc: 2096 KB, TotalAlloc: 4329 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:36 Memory - Alloc: 2101 KB, TotalAlloc: 4334 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:46 Memory - Alloc: 2105 KB, TotalAlloc: 4339 KB, Sys: 12885 KB, NumGC: 18
2025/06/23 05:40:56 Memory - Alloc: 2079 KB, TotalAlloc: 4344 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:06 Memory - Alloc: 2084 KB, TotalAlloc: 4349 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:16 Memory - Alloc: 2088 KB, TotalAlloc: 4354 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:26 Memory - Alloc: 2093 KB, TotalAlloc: 4359 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:36 Memory - Alloc: 2098 KB, TotalAlloc: 4364 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:46 Memory - Alloc: 2103 KB, TotalAlloc: 4369 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:41:56 Memory - Alloc: 2108 KB, TotalAlloc: 4373 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:06 Memory - Alloc: 2113 KB, TotalAlloc: 4378 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:16 Memory - Alloc: 2118 KB, TotalAlloc: 4384 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:26 Memory - Alloc: 2123 KB, TotalAlloc: 4388 KB, Sys: 12885 KB, NumGC: 19
2025/06/23 05:42:36 Memory - Alloc: 2128 KB, TotalAlloc: 4393 KB, Sys: 12885 KB, NumGC: 19

galileo1721 avatar Jun 23 '25 04:06 galileo1721

Wait, the issue is with your implementing code, not with the sdkserver container? (This is what I'm seeing0

Small increases in memory over time may not be an issue, as long as it GC's effectively. SDK are the lightest of wrappers around a gRPC client.

I just ran a test with the simple game server example, that does a very similar thing:

This is the start.

Image

After 12 hours, seems pretty stable at 33.1

Image

After another 8 hours or so, it's up to 36.1

Image

So quite possible there's a small memory leak in there.

Looking at gRPC, I am wondering if there's a small memory leak in the client code (there really isn't much to the SDK) https://github.com/grpc/grpc/issues/38327

markmandel avatar Jun 23 '25 23:06 markmandel

Probably the only way to tell - do a proper pprof, and see where the memory map is storing data. Does seems strange that you're seeing it across C++ and Go. (Go I think is a pure implementation, not a C wrapper).

markmandel avatar Jun 23 '25 23:06 markmandel

Hi @markmandel ,

Thank you very much for your thoughtful response.

I also believe that the memory growth is not coming from the sdkserver container but rather from the Go application itself — and more specifically, it’s likely related to the gRPC client behavior.

To that end, I’ve opened a separate issue here as well: grpc/grpc-go#8403. I’m hoping to get more clarity by combining insights from both discussions.

I appreciate your suggestion to analyze the memory usage with pprof. I’ll begin a more detailed investigation using it. However, I don’t have much experience with pprof, so if you’re able to assist or guide me through any part of the process, that would be incredibly helpful.

In the meantime, I’ll also look up best practices and try to proceed on my own.

Thanks again for your support!

galileo1721 avatar Jun 23 '25 23:06 galileo1721

pprof is remarkably easy to use:

  • https://pkg.go.dev/runtime/pprof
  • https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/ is a nice guide as well.

markmandel avatar Jun 24 '25 00:06 markmandel

Hi @markmandel ,

Thank you again for your previous comment and for sharing those helpful resources.

I'm sorry, but I haven't tried pprof yet. However, I ran a long-duration test using a lightweight Go-based gRPC server with Agones, and as shown by the blue line in the attached image (representing the Go process), memory growth did not occur — likely thanks to Go’s garbage collector.

On the other hand, when testing with a C++ gRPC server, which doesn’t have a GC, memory usage kept increasing over time. To investigate further, I modified the C++ SDK in two places marked with #ifdef MODIFY, reimplementing the Health() method as a short-lived stream that creates a new connection on each call.

After this change, memory usage stopped increasing.

It seems this change helps avoid memory accumulation, but I’m not sure if this approach aligns with Agones’ intended design. Would this one-off streaming approach for Health() be considered acceptable?

Apologies for asking again, and I really appreciate your time and support.

Image

sdks/cpp/src/agones/sdk.cc

// Copyright 2017 Google LLC All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

#include "agones/sdk.h"

#include <grpcpp/grpcpp.h>

#include <utility>

namespace agones {

struct SDK::SDKImpl {
  std::string host_;
  std::shared_ptr<grpc::Channel> channel_;
  std::unique_ptr<agones::dev::sdk::SDK::Stub> stub_;
  std::unique_ptr<grpc::ClientWriter<agones::dev::sdk::Empty>> health_;
  std::unique_ptr<grpc::ClientContext> health_context_;
};

SDK::SDK() : pimpl_{std::make_unique<SDKImpl>()} {
  const char* port = std::getenv("AGONES_SDK_GRPC_PORT");
  pimpl_->host_ = std::string("localhost:") + (port ? port : "9357");
  pimpl_->channel_ =
      grpc::CreateChannel(pimpl_->host_, grpc::InsecureChannelCredentials());
}

SDK::~SDK() {}

bool SDK::Connect() {
  if (!pimpl_->channel_->WaitForConnected(
          gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                       gpr_time_from_seconds(30, GPR_TIMESPAN)))) {
    std::cerr << "Could not connect to the sidecar at " << pimpl_->host_
              << ".\n";
    return false;
  }

  pimpl_->stub_ = agones::dev::sdk::SDK::NewStub(pimpl_->channel_);

#ifdef MODIFY
  // nothing
#else
  // Make the health connection.
  agones::dev::sdk::Empty response;
  pimpl_->health_context_ =
      std::unique_ptr<grpc::ClientContext>(new grpc::ClientContext);
  pimpl_->health_ = pimpl_->stub_->Health(&*pimpl_->health_context_, &response);
#endif

return true;
}

grpc::Status SDK::Ready() {
  grpc::ClientContext context;
  context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                                    gpr_time_from_seconds(30, GPR_TIMESPAN)));
  agones::dev::sdk::Empty request;
  agones::dev::sdk::Empty response;

  return pimpl_->stub_->Ready(&context, request, &response);
}

grpc::Status SDK::Allocate() {
  grpc::ClientContext context;
  context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                                    gpr_time_from_seconds(30, GPR_TIMESPAN)));
  agones::dev::sdk::Empty request;
  agones::dev::sdk::Empty response;

  return pimpl_->stub_->Allocate(&context, request, &response);
}

grpc::Status SDK::Reserve(std::chrono::seconds seconds) {
  grpc::ClientContext context;
  context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                                    gpr_time_from_seconds(30, GPR_TIMESPAN)));

  agones::dev::sdk::Duration request;
  request.set_seconds(seconds.count());

  agones::dev::sdk::Empty response;

  return pimpl_->stub_->Reserve(&context, request, &response);
}

bool SDK::Health() {
  agones::dev::sdk::Empty request;

#ifdef MODIFY
  agones::dev::sdk::Empty response;
  grpc::ClientContext context;
  auto stream = pimpl_->stub_->Health(&context, &response);
  // send request
  if (!stream->Write(request)) {
    std::cerr << "Failed to write health request";
    return false;
  }
  
  // write end
  stream->WritesDone();
  // get result
  grpc::Status status = stream->Finish();
  if (!status.ok()) {
    std::cerr << "Health check failed: " << status.error_message() << "";
    return false;
  }
  return true;
#else
  return pimpl_->health_->Write(request);
#endif
}

grpc::Status SDK::GameServer(agones::dev::sdk::GameServer* response) {
  grpc::ClientContext context;
  context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                                    gpr_time_from_seconds(30, GPR_TIMESPAN)));
  agones::dev::sdk::Empty request;

  return pimpl_->stub_->GetGameServer(&context, request, response);
}

grpc::Status SDK::WatchGameServer(
    const std::function<void(const agones::dev::sdk::GameServer&)>& callback) {
  agones::dev::sdk::Empty request;
  agones::dev::sdk::GameServer gameServer;

  std::unique_ptr<grpc::ClientReader<agones::dev::sdk::GameServer>> reader =
      pimpl_->stub_->WatchGameServer(&watch_gs_context_, request);
  while (reader->Read(&gameServer)) {
    callback(gameServer);
  }
  return reader->Finish();
}

void SDK::CancelWatchGameServer()
{
  watch_gs_context_.TryCancel();
}

grpc::Status SDK::Shutdown() {
  grpc::ClientContext context;
  context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                                    gpr_time_from_seconds(30, GPR_TIMESPAN)));
  agones::dev::sdk::Empty request;
  agones::dev::sdk::Empty response;

  return pimpl_->stub_->Shutdown(&context, request, &response);
}

grpc::Status SDK::SetLabel(std::string key, std::string value) {
  grpc::ClientContext context;
  context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                                    gpr_time_from_seconds(30, GPR_TIMESPAN)));

  agones::dev::sdk::KeyValue request;
  request.set_key(std::move(key));
  request.set_value(std::move(value));

  agones::dev::sdk::Empty response;

  return pimpl_->stub_->SetLabel(&context, request, &response);
}

grpc::Status SDK::SetAnnotation(std::string key, std::string value) {
  grpc::ClientContext context;
  context.set_deadline(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME),
                                    gpr_time_from_seconds(30, GPR_TIMESPAN)));

  agones::dev::sdk::KeyValue request;
  request.set_key(std::move(key));
  request.set_value(std::move(value));

  agones::dev::sdk::Empty response;

  return pimpl_->stub_->SetAnnotation(&context, request, &response);
}
}  // namespace agones

galileo1721 avatar Jul 04 '25 02:07 galileo1721

The other option - try rolling forward gRPC versions and see if the problem goes away. Assuming it has been fixed - if you can identify which version provides the fix, we can lock to that version in the next release as well.

markmandel avatar Jul 04 '25 02:07 markmandel

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Aug 15 '25 10:08 github-actions[bot]

Keeping this open until we validate it's been fixed. I expect an updated grpc will solve this. If we can identify which grpc version is good, we can hardcode it until the Kubernetes updates upgrade us to it.

markmandel avatar Sep 01 '25 22:09 markmandel