opentelemetry-go icon indicating copy to clipboard operation
opentelemetry-go copied to clipboard

Significant Memory Increase with OpenTelemetry Leading to OoMKilled Issues on Kubernetes

Open phthaocse opened this issue 1 year ago • 6 comments

Hello,

Our company is currently using the latest version of OpenTelemetry Go 1.27.0. After implementing OpenTelemetry to record metrics, we noticed a significant increase in memory usage in our pods deployed on Kubernetes, leading to OoMKilled issues. Could you please provide us with any documentation or knowledge regarding how OpenTelemetry manages memory?

Thank you.

phthaocse avatar Jun 01 '24 16:06 phthaocse

This ask is rather vague. OpenTelemetry does not "manage memory" per-se. Go manages memory.

We do have benchmarks that track allocations though. They run on new releases, and manually on an as-needed basis in PRs.

Investigating this would require looking into what exactly is using memory within your application. That may be due to otel (like anything, it does have a memory and cpu footprint). It could also be that you were stretched too thin in term of resources. Without more information, I'm afraid there isn't much more we can do here.

dmathieu avatar Jun 18 '24 09:06 dmathieu

Could you please provide us with any documentation or knowledge regarding how OpenTelemetry manages memory?

I think it would be an overkill. You can always read the codebase.

After implementing OpenTelemetry to record metrics, we noticed a significant increase in memory usage in our pods deployed on Kubernetes, leading to OoMKilled issues.

We cannot do anything without repro steps or profiling data.

pellared avatar Jun 18 '24 11:06 pellared

There's definitely a problem with memory allocations/usage in 1.27 Since I upgraded from 1.24 to 1.27 my service uses more memory, this is from pprof, I hope it can help image

yaniv-s avatar Jun 25 '24 14:06 yaniv-s

Please provide the example code that you used to generate that graphic. I mean not aware of a function in this project called AddTagToContext. It looks like an inlining to grow is happening there. Understanding of that call sight is needed to begin addressing this.

MrAlias avatar Jun 25 '24 21:06 MrAlias

Has anyone found a solution for this issue?

kellis5137 avatar Aug 01 '24 18:08 kellis5137

Just incase someone runs into this problem. I'm not 100% the exact cause, but the resource limits memory 32 megs. I think it needs to be bumped. I up'ed it and it worked. It took me a while to figure out HOW to bump the autoinstrumentation go sidecar. In your Instrumentation manifest, add a go section under the spec object:

apiVersion: opentelemetry.io/v1alpha1

kind: Instrumentation

metadata:

  name: my-instrumentation

spec:
st incase someone runs into this problem. I'm not 100% the exact cause, but the resource limits memory 32 megs. I think it needs to be bumped. I up'ed it and it worked. It took me a while to figure out HOW to bump the autoinstrumentation go sidecar. In your Instrumentation manifest,  add a go section under the spec object:

apiVersion: opentelemetry.io/v1alpha1

kind: Instrumentation

metadata:

  name: my-instrumentation

spec:

  exporter:

    endpoint: http://otel-collector:4317

  propagators:

    - tracecontext
    - baggage
    - b3

  sampler:

    type: parentbased\_traceidratio

    argument: "0.25"

  go:

    resourceRequirements:

      limits:

        cpu: <up the value if necessary>

        memory: <up the value if necessary> # I upped it to 512Mi (normally 32Mi). Going to monitor and see if I can go down

      requests:

        cpu: 5m # this is the original value as of this writing

        memory: 62Mi # I doubled the amount for the default (normally 32Mi)

    env:

      - name: OTEL\_EXPORTER\_OTLP\_ENDPOINT

        value: http://otel-collector:4318
  exporter:

    endpoint: http://otel-collector:4317

  propagators:

    - tracecontext
    - baggage
    - b3

  sampler:

    type: parentbased\_traceidratio

    argument: "0.25"

  go:

    resourceRequirements:

      limits:

        cpu: <up the value if necessary>

        memory: <up the value if necessary> # I upped it to 512Mi (normally 32Mi). Going to monitor and see if I can go down

      requests:

        cpu: 5m # this is the original value as of this writing

        memory: 62Mi # I doubled the amount for the default (normally 32Mi)

    env:

      - name: OTEL\_EXPORTER\_OTLP\_ENDPOINT

        value: http://otel-collector:4318

kellis5137 avatar Aug 02 '24 15:08 kellis5137

I think this ticket can be closed. I was able to trace this back to this change:

// ensureAttributesCapacity inlines functionality from slices.Grow
// so that we can avoid needing to import golang.org/x/exp for go1.20.
// Once support for go1.20 is dropped, we can use slices.Grow available since go1.21 instead.
// Tracking issue: https://github.com/open-telemetry/opentelemetry-go/issues/4819.
func (s *recordingSpan) ensureAttributesCapacity(minCapacity int) {
	if n := minCapacity - cap(s.attributes); n > 0 {
		s.attributes = append(s.attributes[:cap(s.attributes)], make([]attribute.KeyValue, n)...)[:len(s.attributes)]
	}
}

It used to ensure the slice was at least sized n but when it changed to slices.Grow -

// Grow increases the slice's capacity, if necessary, to guarantee space for
// another n elements. After Grow(n), at least n elements can be appended
// to the slice without another allocation. If n is negative or too large to
// allocate the memory, Grow panics.
func Grow[S ~[]E, E any](s S, n int) S {
  if n < 0 {
   panic("cannot be negative")
  }
  if n -= cap(s) - len(s); n > 0 {
   s = append(s[:cap(s)], make([]E, n)...)[:len(s)]
  }
  return s
}

It started ensuring there was at least n more available cap. This bug was introduced here: https://github.com/open-telemetry/opentelemetry-go/commit/561714acb23c896ddd2ca0b5efa45b183f55cdb7 but was fortunately just recently fixed here: https://github.com/open-telemetry/opentelemetry-go/commit/3cbd9671528117454519809f9292fb264415cf38

So v1.31.0 users should not be experiencing this.

AdallomRoy avatar Oct 14 '24 10:10 AdallomRoy

Closing per previous comment.

pellared avatar Oct 15 '24 10:10 pellared