opentelemetry-helm-charts icon indicating copy to clipboard operation
opentelemetry-helm-charts copied to clipboard

[opentelemetry-collector] Remove `memory_ballast` extension from default template and replace by `GOMEMLIMIT`

Open mx-psi opened this issue 11 months ago • 27 comments

We want to remove the memory_ballast extension in favor of Go 1.19+ GOMEMLIMIT environment variable as stated in open-telemetry/opentelemetry-collector/issues/8343.

The first step to verify that we can effectively remove the extension is to change this on the OpenTelemetry Collector Helm chart. We can replace the extension by a value of GOMEMLIMIT that is something like 80-90% of the total memory limit for the Collector container. Once we release this change and wait some time to gather feedback, we can move on with deprecation of the memory ballast extension.

mx-psi avatar Sep 13 '23 11:09 mx-psi

cc @dmitryax @TylerHelmuth

mx-psi avatar Sep 13 '23 11:09 mx-psi

@mx-psi @dmitryax i think we should implement this via a feature-gate-like config.

The first PR can add the option to use the Go setting instead of the extension, but disabled by default. After some time, we can enable it by default, but still allow users who want to use the extension to disable it. Then we'd wait until the extension is deprecated to remove the option.

TylerHelmuth avatar Sep 13 '23 13:09 TylerHelmuth

i think we should implement this via a feature-gate-like config.

Have we done anything similar before on the Collector Helm chart? The plan sounds reasonable to me

mx-psi avatar Sep 14 '23 10:09 mx-psi

@TylerHelmuth should this be a usual configuration option or is there an alternative mechanism in the Helm chart for temporary toggles?

mx-psi avatar Sep 22 '23 10:09 mx-psi

We dont have anything like feature gates in the chart, we normally do it via a configuration option (or 2).

TylerHelmuth avatar Sep 22 '23 15:09 TylerHelmuth

Question: what are we planning to do we do when request limits are not set?

ATM we default memorybalast to (https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/_config.tpl#L25C1-L27C1) :

      memory_ballast:
        size_in_percentage: 40

and requests / limits to empty:

resources: {}
# resources:
#   limits:
#     cpu: 250m
#     memory: 512Mi

I think the only thing we can do is not set the GOMEMLIMIT env?

This would be a different behaviour from current one.

povilasv avatar Sep 26 '23 08:09 povilasv

I think the only thing we can do is not set the GOMEMLIMIT env?

I am personally fine with that, if you explicitly unset resource limits then you should be on your own to configure this

mx-psi avatar Sep 26 '23 08:09 mx-psi

I have started work on this

TylerHelmuth avatar Sep 27 '23 20:09 TylerHelmuth

How much should we wait until we enable the GOMEMLIMIT option by default? Is something like one month enough?

mx-psi avatar Oct 09 '23 12:10 mx-psi

@mx-psi is this helm chart feature critical for the decision on whether or not to deprecate the extension or has the decision already been made to deprecate it?

TylerHelmuth avatar Oct 09 '23 17:10 TylerHelmuth

@mx-psi is this helm chart feature critical for the decision on whether or not to deprecate the extension or has the decision already been made to deprecate it?

I would want to validate the approach with the Helm chart before committing to deprecate the extension. I don't feel like I personally have enough knowledge about the Go garbage collector to ensure this replacement will be satisfactory for all use cases (it probably will, that's why I pushed for this, but I want to be sure before deprecating the extension).

If anybody else feels like this is clear cut, then we don't have to wait for this, but personally I feel like this is safer.

mx-psi avatar Oct 09 '23 19:10 mx-psi

In that case I think we need to leave it around as long as it take to make a decision. We won't move forward with turning it always on until we're sure we are deprecating the extension.

I can enable it on our internal use of the collector, but we'll need other examples as well.

TylerHelmuth avatar Oct 10 '23 14:10 TylerHelmuth

@JaredTan95 based on #912 it seems like you are testing this out. Would you mind sharing any findings about the setting (memory and CPU usage patterns before/after?)

I also posted a call out for more testing on a few channels on the CNCF Slack. Once we have at least two I think we can proceed with the deprecation and enable this by default.

mx-psi avatar Oct 11 '23 12:10 mx-psi

We have been running Collectors using the Helm chart with useGOMEMLIMIT enabled for a few weeks on Datadog's infrastructure with a similar CPU usage as with the memory ballast and lower memory consumption (because of the memory ballast).

mx-psi avatar Nov 03 '23 10:11 mx-psi

Can you share some stats/numbers on the improvements?

krisztianfekete avatar Nov 03 '23 14:11 krisztianfekete

We see similar stats at Honeycomb

TylerHelmuth avatar Nov 03 '23 16:11 TylerHelmuth

@JaredTan95 based on #912 it seems like you are testing this out. Would you mind sharing any findings about the setting (memory and CPU usage patterns before/after?)

I also posted a call out for more testing on a few channels on the CNCF Slack. Once we have at least two I think we can proceed with the deprecation and enable this by default.

We switched to this(useGOMEMLIMIT) configuration because of an issue with memory ballast's compatibility with our arm architecture. We haven't verified this configuration in production, but in our test environment, this feature solves the arm compatibility problem and the effect is similar to that of the original memory ballast.

So, I support it.

JaredTan95 avatar Nov 04 '23 16:11 JaredTan95

@mx-psi I'd still like to see the extension officially deprecated in Core before we make this a default value in the helm chart.

TylerHelmuth avatar Nov 05 '23 22:11 TylerHelmuth

@JaredTan95 can you please share more details about memory ballast's compatibility issues with our arm architecture? It be nice to post them in https://github.com/open-telemetry/opentelemetry-collector/issues/8343 as well

dmitryax avatar Nov 06 '23 02:11 dmitryax

Based on the discussion on https://github.com/open-telemetry/opentelemetry-collector/issues/8343 and this issue, I filed open-telemetry/opentelemetry-collector#8803 to deprecate the memory ballast extension.

mx-psi avatar Nov 06 '23 17:11 mx-psi

I'd prefer the next step before deprecating the memory_ballast extension to be switching the useGOMEMLIMIT feature gate to true in the helm chart. That way we are not forced to do it and will have a chance to get some more feedback before the actual deprecation. I'd think we should keep useGOMEMLIMIT: true for a couple of versions, then deprecate memory_ballast extension. @mx-psi @JaredTan95 @povilasv @TylerHelmuth WDYT?

dmitryax avatar Dec 04 '23 17:12 dmitryax

I am fine with that, although I think @TylerHelmuth wanted to go the other way around :smile: Up to the Helm chart maintainers I guess :)

mx-psi avatar Dec 04 '23 17:12 mx-psi

Ya I was imagining after the initial testing that the helm chart would follow the guidance from Core. My though being that I didn't want to disrupt users until the component was actually deprecated.

But we could enable it by default and use helm chart users as testers (unless they switch it back to false).

TylerHelmuth avatar Dec 04 '23 17:12 TylerHelmuth

But we could enable it by default and use helm chart users as testers (unless they switch it back to false).

That's what I thought. I wanted to gather as much feedback as possible before the deprecation. I'm thinking of useGOMEMLIMIT as an improvement anyway

dmitryax avatar Dec 04 '23 17:12 dmitryax

I'd think we should keep useGOMEMLIMIT: true for a couple of versions, then deprecate memory_ballast extension

I agree! 👏

JaredTan95 avatar Dec 06 '23 01:12 JaredTan95

@JaredTan95 can you please share more details about memory ballast's compatibility issues with our arm architecture? It be nice to post them in open-telemetry/opentelemetry-collector#8343 as well

The compatibility problem we encountered was only tested on the Chinese operating system(Kylin OS), so I think it has little reference value for users using international operating systems, so I did not put it up :-P

JaredTan95 avatar Dec 06 '23 01:12 JaredTan95

open-telemetry/opentelemetry-collector/pull/8803 got merged, we can revert before Jan 8th if we find issues

mx-psi avatar Dec 20 '23 12:12 mx-psi