opentelemetry-collector
opentelemetry-collector copied to clipboard
[exporterhelper] fix panic introduced by nil pointer in send
Description
There is no guarantee that after the exporterhelper sends the plog data downstream that the data won't be mutated in some way. (e.g by the batch_sender) This mutation could result in the proceeding call to req.ItemsCount() to panic due to nil pointer (by the time send returns, the resourcelogslice within the request might no longer be valid).
Link to tracking issue
Fixes # https://github.com/open-telemetry/opentelemetry-collector/issues/10033
Testing
Existing test cases should cover this code change.
Documentation
Hey @dmitryax and @atoulme. Just to give a bit more context. Our team was able to reproduce the problem only when we send large amount of data(500GB per day in an instance) through. So the theory in the description fits as this can only happen in very rare race condition. Can you take a look and see if this change is ok?
The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 92.24%. Comparing base (
49ea32b) to head (862d293). Report is 2 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #10402 +/- ##
=======================================
Coverage 92.24% 92.24%
=======================================
Files 403 403
Lines 18720 18723 +3
=======================================
+ Hits 17268 17271 +3
Misses 1097 1097
Partials 355 355
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
We are facing the same issue with a custom receiver.
...
logs, err := receiver.parser.Parse(body)
if err != nil {
writeError(w, err, http.StatusBadRequest)
return
}
// At this point, the receiver has accepted the payload
ctx := receiver.obsreport.StartLogsOp(req.Context())
err = receiver.nextConsumer.ConsumeLogs(ctx, logs)
receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)
if err != nil {
writeError(w, err, http.StatusInternalServerError)
return
}
...
Panic happens here:
receiver.obsreport.EndLogsOp(ctx, metadata.Type.String(), logs.LogRecordCount(), err)
Can we investigate this? Or maybe mention somewhere in the documentation this will happen?
@grandwizard28 that seems like a different issue, because it happens outside of exporterhelper. Please open a new issue, post version and pipeline. You can redact your custom receiver and just post the snippet of code you reference here.
The fix seems simple enough, but I'd be interested to try and see if a unit test can still catch this if we provoke this problem.
@atoulme I think we can merge this PR as it is, since, the race condition would happen when the request is completed and then the next line is called. Catching this in a unit test would be tricky...
err := lewo.nextSender.send(c, req)
lewo.obsrep.EndLogsOp(c, req.ItemsCount(), err)
WDYT?
If we do this:
- Having a unit test and making the clear expectation that it is not safe to call ItemsCount on any logs/metrics/traces after consumption needs to be spelled out
- we should do this across all signals, not just logs.
Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.
@atoulme yeah it is understandable to have a unit test to re-create the failures. Our team looked into it for a while but can't really reproduce the problem with unit test. Even on production it is only reproducible under specific scenario(specific pipeline setup and throughput).
We changed our internal pipeline so it doesn't happen again in production for now. We tried and wasn't able to re-create the failure in a unit test. If anyone else has any suggestions please feel free to post them.
Catching this in a unit test is not particularly tricky, if we're just looking to reproduce the failure: reset the logs to nil and create a situation where the panic is reproduced.
I'm talking about the changes in this PR. In the unit test, mimicking an actual send method, and making that send happen before req.ItemsCount() is called is gonna be nondeterministic, and I think that's what @splunkericl is also suggesting.
Any other way to mimic it would involve emptying the req explicitly before calling req.ItemsCount(), which would throw a null pointer error but I don't think we need that UT.
That's why I said, IMO we can merge it without UT, also internationally we have also patched this in our custom receiver.
That looks good - can you please add the same test and fix for metrics and traces? Thanks!
Hi @dmitryax, mind taking a look?