opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
splunkhec receiver memory leak
Component(s)
receiver/splunkhec
What happened?
Description
We've uncovered what appears to be a memory leak in the splunkhec receiver that surfaced around version 0.102.0. We run the the receiver in both a metric and log pipeline. We are not sending logs using hec yet the collector is holding 2gb of heap under (*ObsReport).StartLogsOp (obsreport.go:L#95).
Steps to Reproduce
splunkhec receiver in logs and metrics pipeline
Expected Result
Memory remains stable
Actual Result
Memory grows until we trigger the memory limiter then the collector gc's
Collector version
0.106.1
Environment information
Environment
OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
When we remove the splunhecreceiver from the logs pipeline the memory is stable. At the moment we only have applications sending metrics to the hec receiver in the impacted clusters.
Pinging code owners:
- receiver/splunkhec: @atoulme
See Adding Labels via Comments if you do not have permissions to add labels yourself.
+1
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.106.1/receiver/splunkhecreceiver/receiver.go#L516-L528
Suspicious of this section of code. The r.obsrecv.EndLogsOp will only happen if there are logs events
if r.logsConsumer != nil && len(events) > 0 {
ld, err := splunkHecToLogData(r.settings.Logger, events, resourceCustomizer, r.config)
if err != nil {
r.failRequest(ctx, resp, http.StatusBadRequest, errUnmarshalBodyRespBody, len(events), err)
return
}
decodeErr := r.logsConsumer.ConsumeLogs(ctx, ld)
r.obsrecv.EndLogsOp(ctx, metadata.Type.String(), len(events), decodeErr)
if decodeErr != nil {
r.failRequest(ctx, resp, http.StatusInternalServerError, errInternalServerError, len(events), decodeErr)
return
}
}
Thanks for the report, the code is pure spaghetti, sorry. I have a fix out for review: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34911
Ideally we should have tests that look for this type of leaks - might be good for us to have generated tests for all components that do that down the road.