opentelemetry-collector icon indicating copy to clipboard operation
opentelemetry-collector copied to clipboard

Consider removing `memory_ballast` extension

Open mx-psi opened this issue 2 years ago • 9 comments

The memory_ballast extension allows defining a 'heap ballast' to make the behavior of Go's garbage collector more predictable. Quoting this post,

the ballast increases the base size of the heap so that our GC triggers are delayed and the number of GC cycles over time is reduced.

This functionality was added in #45 but a few years and Go versions have passed since and there has since been discussion (e.g. see https://github.com/open-telemetry/opentelemetry-collector/issues/7512#issuecomment-1499605455 ) on the need of this extension today. In particular, tweaking GOMEMLIMIT and GOGC may be enough to tackle the issues the extension was meant to address. It also seems like the extension may be risky in some cases and increase memory usage (this may be the cause of #7512).

The most recent issue on the matter upstream is golang/go/issues/42430.

This issue is to decide whether we should remove the memory ballast extension or not.

cc @Aneurysm9 @dmitryax

mx-psi avatar Aug 31 '23 15:08 mx-psi

@mx-psi thank you for summarizing the issue.

I don't see any good reason to keep this extension. GOMEMLIMIT env var is much more intuitive to use. The only configuration that the user can potentially miss is size_in_percentage, but I doubt it's something we should keep this extension for.

I'd suggest we just go ahead and deprecate it with clear instructions on how to utilize GOMEMLIMIT and GOGC instead. We can link this issue in the deprecation warning for users to comment if they have any concerns.

Let's discuss it in the next SIG meeting.

dmitryax avatar Sep 05 '23 01:09 dmitryax

One starting point to ensure we can do this correctly is to replace the memory ballast extension by an equivalent mechanism using GOMEMLIMIT and GOGC in the Collector Helm chart and ensure that it works correctly in that setting. We can use that experience to aid us in writing the documentation.

mx-psi avatar Sep 05 '23 09:09 mx-psi

From Atlassian's experience, we've not needed the ballast extension for a deployment that is strictly observing the host (doesn't expose any endpoints for services to send data, and prom is used to monitor the collector itself).

On average, our deployments will use:

  • 50Mb of memory with a hard limit 300Mb (set by docker aka cgroups)
  • 3% CPU utilisation with a hard limit of 1 full core.

What is being monitored by the collector changes from service to service but roughly 4 - 10 receivers are running and capturing data. (some are internal to us so it didn't make sense to list them out).

We have an internal service that works near similar to the collector which we had messed around with the GOGC value and we noted as a side affect that having control over this either meant high CPU usage no significant reduction in memory usage; the service was spend more time trying to perform the GC operation but not allowing the progress of the business functions. The alternative was high memory usage which eventually lead to OOMs. However, the go version used here was 1.17 which look like things have improved from what I am reading.

To get back on topic, I think the risk of continued maintenance of the ballast extension is:

  • It is potentially hiding memory leaks in the collector by cleaning up faster
    • I had seen this when components used pointers to the pdata values which meant things were being escaped to the heap when being on the stack was adequate.
  • I think we should leverage the language runtime more here
    • There will be more public documentation / articles that we can reference with understanding how to tune the collector's heap usage.
  • Manually calling the GC via runtime.GC is a hazard due to more frequent Stop The World sweeps which could make the performance problems worse.
    • Consider the ballast called runtime.GC due to going over the memory limit but has happened after the runtime itself had performed a GC operation. Since the method can block the entire application, it means that no connections will be served until it is done and if collector is running in a scenario that it needs the extension then it more than likely experience an unexplainable brief outage.

I believe it served us well, but I see it as a bigger risk keeping it around.

MovieStoreGuy avatar Sep 06 '23 07:09 MovieStoreGuy

Here is a user report for GOMEMLIMIT as a replacement for a memory ballast.

An alternative to outright removal would be to modify the current implementation to use debug.SetMemoryLimit. The relationship between ballast size and memory limit value is not clear to me however, and it can be confusing having a memory_ballast extension that does not in fact use a memory ballast.

mx-psi avatar Sep 06 '23 08:09 mx-psi

Filed open-telemetry/opentelemetry-helm-charts/issues/891 to do https://github.com/open-telemetry/opentelemetry-collector/issues/8343#issuecomment-1706261604 as decided on the last Collector SIG

mx-psi avatar Sep 13 '23 11:09 mx-psi

Are there any plans on implementing GOMEMLIMIT for the operator/operator-chart as well?

hanswurscht avatar Oct 06 '23 10:10 hanswurscht

@hanswurscht I think we can do something similar to the Helm chart approach, it will further help validate that GOMEMLIMIT is a valid alternative. Could you file an issue on the operator repository and link it to this one?

mx-psi avatar Oct 06 '23 11:10 mx-psi

Based on the discussion on https://github.com/open-telemetry/opentelemetry-helm-charts/issues/891 and this issue, I filed #8803 to deprecate the memory ballast extension.

mx-psi avatar Nov 06 '23 17:11 mx-psi

It is potentially hiding memory leaks in the collector by cleaning up faster

This is a misunderstanding. Go heap profiles only show what survived after the last garbage-collection.

(Also heap ballast serves to slow down garbage collection)

bboreham avatar Dec 04 '23 10:12 bboreham

I have an anecdote for you.

We had been struggling with trying to tune our gateway collectors for a few weeks after we started experimenting with connectors to generate metrics from our traces. Though we did identify some memory leaks, under load the collector would spike its memory usage and eventually cause the memory_limiter to start rejecting spans and metrics.

It would then get stuck in GC loops until our readiness probes failed and k8s would lay them to rest.

After replacing memory_ballast+GOMEMLIMIT with only GOMEMLIMIT we see a drastic reduction of memory of the collectors and increased throughput:

Image

arielvalentin avatar Feb 16 '24 15:02 arielvalentin