sdk-go Some memory allocated by the receiver doesn't get freed

Some memory allocated by the receiver doesn't get freed

Open antoineco opened this issue 3 years ago • 10 comments

SDK v2.3.1

Yesterday, I performed 2 consecutive load tests at respectively 100,000 and 120,000 events/sec against the SDK's default client.

Conditions:

20 GCE n1-standard-4 nodes (2 cores, 6 GiB memory)
20 replicas (1/node) of a neutral CloudEvent receiver (DefaultClient, func() {} callback)
2 KiB fixed payload per event

I noticed 2 things:

The memory usage increased at the beginning of the first load test from 3 MiB to 7 MiB (expected), then remained stable for 20min until the second test (unexpected).
The memory usage increased from 8-15 MiB to 100-103 MiB out of sudden at the end of the test, although the load was constant.

I'm wondering whether the plateau in 1. is expected. I assumed that the memory usage would return to 3 MiB eventually.
Although 2. can be explained by some sudden CPU throttling upstream in my Istio gateways, we can still observe a plateau at 75-80 MiB after the load test completed.

I haven't investigated yet, and the situation doesn't seem alarming to me, but I wanted to raise some attention on this anyway.

Fig. receiver pods - CPU and memory profile

hero-istio-cpu-throttling-120k

Fig. Istio gateways - CPU throttling

Dec 04 '20 11:12 antoineco

I observe a similar behaviour in Knative with the IMC dispatcher, which memory usage remains high forever after a load test.

Dec 07 '20 18:12 antoineco

@antoineco this seems like a serious issue, It could be that on the long run the memory allocated by memory pools are never really freed... Can you provide a profiling of the memory using the golang pprof tools?

Dec 09 '20 07:12 slinkydeveloper

That was the next step but I got carried away with other things. I'll load test my receiver one more time and upload the profile in a few hours.

Dec 09 '20 08:12 antoineco

Thanks!

Also upload a cpuprofile (but don't acquire both in the same run)

Dec 09 '20 08:12 slinkydeveloper

OK, enough time has elapsed after my test (12,000 event/sec during 3 min).

Here is the graph showing the CPU and memory usage. I captured a profile at the very end.

Trace captured during the load test:

trace.out.gz

I see only ~10 MiB allocated to the heap. The 45 MiB reported by Prometheus come from this metric: container_memory_working_set_bytes (hi, Brendan Gregg).

container_memory_working_set_bytes{namespace="perf-receiver", pod="$pod", container="receiver"}

side note

Here are all cAdvisor's memory-related metrics in a single stacked graph, for comparison:

The same metrics when the receiver doesn't process any request, right after the process starts:

Dec 09 '20 12:12 antoineco

Ok so i can't really talk for the memory allocated by the container, but about the process itself 10Mb is definitely a reasonable amount of memory to keep the runtime and the pools "warm", i don't expect this memory to get deallocated while the application is running.

The flamegraphs are definitely reasonable too... I might be missing the point here

Dec 09 '20 14:12 slinkydeveloper

I created the issue before investigating. If the behaviour is normal, there is no point to miss and we can close this. The analysis and conclusion may still benefit someone else, who knows, especially since this can be observed in anything that uses the Go SDK.

Dec 09 '20 15:12 antoineco

I think 10 Mb is quite normal in every golang application that uses some sort of pooling I guess, even ones that doesn't use pooling... I even think it's too little :smile:

Dec 09 '20 16:12 slinkydeveloper

My concern was not about the 10 MiB, this is absolutely reasonable.

My concern was about the value of the WSS, which starts at < 4 MiB, increases under load, then returns at a value that is 10x the original value at rest, forever, even when the traffic is literally 0 for half a day. By definition that memory can not be freed, so something is holding it.

Dec 09 '20 16:12 antoineco

My concern was about the value of the WSS, which starts at < 4 MiB, increases under load, then returns at a value that is 10x the original value at rest, forever, even when the traffic is literally 0 for half a day.

Ok I see it. I think that a good counter test to understand if this is a GCP behaviour is to check with another machine, say a bare metal machine with centos/debian/suse or whatever other distro you want. Trying a different kernel might lead us to understand if that's a behaviour of the gcp underlying kernel memory allocator or something else, related to sdk-go

Dec 09 '20 16:12 slinkydeveloper

sdk-go sdk-go copied to clipboard

Some memory allocated by the receiver doesn't get freed

sdk-go
sdk-go copied to clipboard