opentelemetry-go
opentelemetry-go copied to clipboard
Significant Memory Increase with OpenTelemetry Leading to OoMKilled Issues on Kubernetes
Hello,
Our company is currently using the latest version of OpenTelemetry Go 1.27.0. After implementing OpenTelemetry to record metrics, we noticed a significant increase in memory usage in our pods deployed on Kubernetes, leading to OoMKilled issues. Could you please provide us with any documentation or knowledge regarding how OpenTelemetry manages memory?
Thank you.
This ask is rather vague. OpenTelemetry does not "manage memory" per-se. Go manages memory.
We do have benchmarks that track allocations though. They run on new releases, and manually on an as-needed basis in PRs.
Investigating this would require looking into what exactly is using memory within your application. That may be due to otel (like anything, it does have a memory and cpu footprint). It could also be that you were stretched too thin in term of resources. Without more information, I'm afraid there isn't much more we can do here.
Could you please provide us with any documentation or knowledge regarding how OpenTelemetry manages memory?
I think it would be an overkill. You can always read the codebase.
After implementing OpenTelemetry to record metrics, we noticed a significant increase in memory usage in our pods deployed on Kubernetes, leading to OoMKilled issues.
We cannot do anything without repro steps or profiling data.
There's definitely a problem with memory allocations/usage in 1.27
Since I upgraded from 1.24 to 1.27 my service uses more memory, this is from pprof, I hope it can help
Please provide the example code that you used to generate that graphic. I mean not aware of a function in this project called AddTagToContext. It looks like an inlining to grow is happening there. Understanding of that call sight is needed to begin addressing this.
Has anyone found a solution for this issue?
Just incase someone runs into this problem. I'm not 100% the exact cause, but the resource limits memory 32 megs. I think it needs to be bumped. I up'ed it and it worked. It took me a while to figure out HOW to bump the autoinstrumentation go sidecar. In your Instrumentation manifest, add a go section under the spec object:
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: my-instrumentation
spec:
st incase someone runs into this problem. I'm not 100% the exact cause, but the resource limits memory 32 megs. I think it needs to be bumped. I up'ed it and it worked. It took me a while to figure out HOW to bump the autoinstrumentation go sidecar. In your Instrumentation manifest, add a go section under the spec object:
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: my-instrumentation
spec:
exporter:
endpoint: http://otel-collector:4317
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased\_traceidratio
argument: "0.25"
go:
resourceRequirements:
limits:
cpu: <up the value if necessary>
memory: <up the value if necessary> # I upped it to 512Mi (normally 32Mi). Going to monitor and see if I can go down
requests:
cpu: 5m # this is the original value as of this writing
memory: 62Mi # I doubled the amount for the default (normally 32Mi)
env:
- name: OTEL\_EXPORTER\_OTLP\_ENDPOINT
value: http://otel-collector:4318
exporter:
endpoint: http://otel-collector:4317
propagators:
- tracecontext
- baggage
- b3
sampler:
type: parentbased\_traceidratio
argument: "0.25"
go:
resourceRequirements:
limits:
cpu: <up the value if necessary>
memory: <up the value if necessary> # I upped it to 512Mi (normally 32Mi). Going to monitor and see if I can go down
requests:
cpu: 5m # this is the original value as of this writing
memory: 62Mi # I doubled the amount for the default (normally 32Mi)
env:
- name: OTEL\_EXPORTER\_OTLP\_ENDPOINT
value: http://otel-collector:4318
I think this ticket can be closed. I was able to trace this back to this change:
// ensureAttributesCapacity inlines functionality from slices.Grow
// so that we can avoid needing to import golang.org/x/exp for go1.20.
// Once support for go1.20 is dropped, we can use slices.Grow available since go1.21 instead.
// Tracking issue: https://github.com/open-telemetry/opentelemetry-go/issues/4819.
func (s *recordingSpan) ensureAttributesCapacity(minCapacity int) {
if n := minCapacity - cap(s.attributes); n > 0 {
s.attributes = append(s.attributes[:cap(s.attributes)], make([]attribute.KeyValue, n)...)[:len(s.attributes)]
}
}
It used to ensure the slice was at least sized n but when it changed to slices.Grow -
// Grow increases the slice's capacity, if necessary, to guarantee space for
// another n elements. After Grow(n), at least n elements can be appended
// to the slice without another allocation. If n is negative or too large to
// allocate the memory, Grow panics.
func Grow[S ~[]E, E any](s S, n int) S {
if n < 0 {
panic("cannot be negative")
}
if n -= cap(s) - len(s); n > 0 {
s = append(s[:cap(s)], make([]E, n)...)[:len(s)]
}
return s
}
It started ensuring there was at least n more available cap. This bug was introduced here: https://github.com/open-telemetry/opentelemetry-go/commit/561714acb23c896ddd2ca0b5efa45b183f55cdb7 but was fortunately just recently fixed here: https://github.com/open-telemetry/opentelemetry-go/commit/3cbd9671528117454519809f9292fb264415cf38
So v1.31.0 users should not be experiencing this.
Closing per previous comment.