opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

splunkhec receiver memory leak

Open clheikes opened this issue 1 year ago • 1 comments

Component(s)

receiver/splunkhec

What happened?

Description

We've uncovered what appears to be a memory leak in the splunkhec receiver that surfaced around version 0.102.0. We run the the receiver in both a metric and log pipeline. We are not sending logs using hec yet the collector is holding 2gb of heap under (*ObsReport).StartLogsOp (obsreport.go:L#95).

Steps to Reproduce

splunkhec receiver in logs and metrics pipeline

Expected Result

Memory remains stable

Actual Result

Memory grows until we trigger the memory limiter then the collector gc's

35C08CF1-ECDD-4999-AFE6-21BEA453619A_1_105_c

49D534B1-3EF8-4949-B671-7B4C66E8E1E7

Collector version

0.106.1

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

When we remove the splunhecreceiver from the logs pipeline the memory is stable. At the moment we only have applications sending metrics to the hec receiver in the impacted clusters.

clheikes avatar Aug 27 '24 19:08 clheikes

Pinging code owners:

  • receiver/splunkhec: @atoulme

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] avatar Aug 27 '24 19:08 github-actions[bot]

+1

brettplarson avatar Aug 28 '24 14:08 brettplarson

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.106.1/receiver/splunkhecreceiver/receiver.go#L516-L528

Suspicious of this section of code. The r.obsrecv.EndLogsOp will only happen if there are logs events

	if r.logsConsumer != nil && len(events) > 0 {
		ld, err := splunkHecToLogData(r.settings.Logger, events, resourceCustomizer, r.config)
		if err != nil {
			r.failRequest(ctx, resp, http.StatusBadRequest, errUnmarshalBodyRespBody, len(events), err)
			return
		}
		decodeErr := r.logsConsumer.ConsumeLogs(ctx, ld)
		r.obsrecv.EndLogsOp(ctx, metadata.Type.String(), len(events), decodeErr)
		if decodeErr != nil {
			r.failRequest(ctx, resp, http.StatusInternalServerError, errInternalServerError, len(events), decodeErr)
			return
		}
	}

clheikes avatar Aug 28 '24 16:08 clheikes

Thanks for the report, the code is pure spaghetti, sorry. I have a fix out for review: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/34911

Ideally we should have tests that look for this type of leaks - might be good for us to have generated tests for all components that do that down the road.

atoulme avatar Aug 28 '24 17:08 atoulme