opentelemetry-collector [exporterhelper] fix panic introduced by nil pointer in send

Description

There is no guarantee that after the exporterhelper sends the plog data downstream that the data won't be mutated in some way. (e.g by the batch_sender) This mutation could result in the proceeding call to req.ItemsCount() to panic due to nil pointer (by the time send returns, the resourcelogslice within the request might no longer be valid).

Link to tracking issue

Fixes # https://github.com/open-telemetry/opentelemetry-collector/issues/10033

Testing

Existing test cases should cover this code change.

Documentation

Jun 13 '24 17:06 zpzhuSplunk

Hey @dmitryax and @atoulme. Just to give a bit more context. Our team was able to reproduce the problem only when we send large amount of data(500GB per day in an instance) through. So the theory in the description fits as this can only happen in very rare race condition. Can you take a look and see if this change is ok?

Jun 14 '24 16:06 splunkericl

The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.

Jun 14 '24 16:06 atoulme

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.24%. Comparing base (49ea32b) to head (862d293). Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #10402   +/-   ##
=======================================
  Coverage   92.24%   92.24%           
=======================================
  Files         403      403           
  Lines       18720    18723    +3     
=======================================
+ Hits        17268    17271    +3     
  Misses       1097     1097           
  Partials      355      355

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 14 '24 17:06 codecov[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Jul 03 '24 03:07 github-actions[bot]

We are facing the same issue with a custom receiver.

...
logs, err := receiver.parser.Parse(body)
if err != nil {
	writeError(w, err, http.StatusBadRequest)
	return
}

// At this point, the receiver has accepted the payload
ctx := receiver.obsreport.StartLogsOp(req.Context())
err = receiver.nextConsumer.ConsumeLogs(ctx, logs)
receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)

if err != nil {
	writeError(w, err, http.StatusInternalServerError)
	return
}
...

Panic happens here:

receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)

Can we investigate this? Or maybe mention somewhere in the documentation this will happen?

Jul 16 '24 16:07 therealpandey

@grandwizard28 that seems like a different issue, because it happens outside of exporterhelper. Please open a new issue, post version and pipeline. You can redact your custom receiver and just post the snippet of code you reference here.

Jul 16 '24 16:07 atoulme

The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.

@atoulme I think we can merge this PR as it is, since, the race condition would happen when the request is completed and then the next line is called. Catching this in a unit test would be tricky...

	err := lewo.nextSender.send(c, req)
	lewo.obsrep.EndLogsOp(c, req.ItemsCount(), err)

WDYT?

Jul 16 '24 17:07 shivanshuraj1333

If we do this:

Having a unit test and making the clear expectation that it is not safe to call ItemsCount on any logs/metrics/traces after consumption needs to be spelled out
we should do this across all signals, not just logs.

Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.

Jul 16 '24 19:07 atoulme

@atoulme yeah it is understandable to have a unit test to re-create the failures. Our team looked into it for a while but can't really reproduce the problem with unit test. Even on production it is only reproducible under specific scenario(specific pipeline setup and throughput).

We changed our internal pipeline so it doesn't happen again in production for now. We tried and wasn't able to re-create the failure in a unit test. If anyone else has any suggestions please feel free to post them.

Jul 16 '24 20:07 splunkericl

Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.

I'm talking about the changes in this PR. In the unit test, mimicking an actual send method, and making that send happen before req.ItemsCount() is called is gonna be nondeterministic, and I think that's what @splunkericl is also suggesting.

Any other way to mimic it would involve emptying the req explicitly before calling req.ItemsCount(), which would throw a null pointer error but I don't think we need that UT.

That's why I said, IMO we can merge it without UT, also internationally we have also patched this in our custom receiver.

Jul 16 '24 22:07 shivanshuraj1333

That looks good - can you please add the same test and fix for metrics and traces? Thanks!

Jul 23 '24 17:07 atoulme

Hi @dmitryax, mind taking a look?

Jul 24 '24 16:07 zpzhuSplunk

opentelemetry-collector opentelemetry-collector copied to clipboard

[exporterhelper] fix panic introduced by nil pointer in send

Description

Link to tracking issue

Testing

Documentation

Codecov Report

opentelemetry-collector
opentelemetry-collector copied to clipboard