fix(ddtracer/tracer): retry sendStats on EOF by marking POST idempotent
What does this PR do?
Addresses the edge-case of datadog-agent closing a persistent connection while dd-trace-go is trying to send it a request, resulting in error logs.
Encode the stats payload once, use a bytes.Reader for the body, and set req.GetBody + ContentLength. Also add an empty Idempotency-Key header (zero-length slice) so net/http.Transport treats the request as idempotent and can automatically retry on network errors when GetBody is available.
This fixes intermittent Post "...": EOF failures caused by a race where the agent (~5s idle timeout) closes an idle connection while the tracer (~90s) is sending- Write succeeds, then Read returns io.EOF.
Tests:
- Add
TestSendStatsNetworkErrorRetrywith aneofConnthat returns EOF after a successfulWriteto force the Transport retry path. - Assert two dials (retry on a new conn),
/statsis retried, andContent-Lengthis set. The server sees three handler hits (/info& two/statsdue to retry) while the client records two requests.
Result: sendStats transparently retries on transient EOFs; no behavior change on the success path.
Motivation
I was getting a ton of Post "...": EOF errors in my logs.
Reviewer's Checklist
- [ ] Changed code has unit tests for its functionality at or near 100% coverage.
- [ ] System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
- [ ] There is a benchmark for any new code, or changes to existing code.
- [ ] If this interacts with the agent in a new way, a system test has been added.
- [ ] New code is free of linting errors. You can check this by running
./scripts/lint.shlocally. - [ ] Add an appropriate team label so this PR gets put in the right place for the release notes.
- [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.
@lattwood Thanks for the contribution. Allow us to ping the agent team and extra eyes from our guild. This looks good in a quick review, but we want to double check.
In the meantime, can you check the test? Thanks!
--- FAIL: TestSendStatsNetworkErrorRetry (0.01s)
transport_test.go:615:
Error Trace: /home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/transport_test.go:615
Error: Not equal:
expected: 2
actual : 3
Test: TestSendStatsNetworkErrorRetry
will do!
i am going to have to start drinking if this is because of a macos/linux idiosyncrasy 🙃
On Thu, Sep 25, 2025 at 4:42 AM Dario Castañé @.***> wrote:
darccio left a comment (DataDog/dd-trace-go#3991) https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3332619203
In the meantime, can you check the test? Thanks!
--- FAIL: TestSendStatsNetworkErrorRetry (0.01s) transport_test.go:615: Error Trace: /home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/transport_test.go:615 Error: Not equal: expected: 2 actual : 3 Test: TestSendStatsNetworkErrorRetry
— Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3332619203, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKD3T2CCDZI3IEFWLKD3UOMHHAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZSGYYTSMRQGM . You are receiving this because you were mentioned.Message ID: @.***>
They would. I took a quick skim through the stats protobuf and it doesn't look like that would be an issue.
To fix this without a blind retry, we'd have to significantly increase the idle/read timeout from 3 seconds in the agent, which i'm guessing was done for a reason (the 90s here appears to be a go default).
hoping to get at the test failure tomorrow.
On Thu, Sep 25, 2025 at 12:11 PM Andrew Glaude @.***> wrote:
@.**** commented on this pull request.
In ddtrace/tracer/transport.go https://github.com/DataDog/dd-trace-go/pull/3991#discussion_r2379501000:
if err != nil {return err} +
- // by providing GetBody and a zero length slice for the Idempotency-Key
This seems like it could work well here, but wouldn't this mean all these requests are considered idempotent so even on a network error where the agent received and processed the request this would cause a retry of the same request? I took a look at the net/http docs for clarity here but they aren't clarifying this exact case for me
— Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-trace-go/pull/3991#pullrequestreview-3268273491, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKFMHH5GJE2MYTWE4T33UQA2DAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTENRYGI3TGNBZGE . You are receiving this because you were mentioned.Message ID: @.***>
Hi @lattwood ,
Following up to see if you've made any progress with TestSendStatsNetworkErrorRetry?
Thanks!
I haven't unfortunately, been fire fighting
On Thu, Oct 9, 2025 at 4:44 PM Mikayla Toffler @.***> wrote:
mtoffl01 left a comment (DataDog/dd-trace-go#3991) https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3387268865
Hi @lattwood https://github.com/lattwood , Following up to see if you've made any progress with TestSendStatsNetworkErrorRetry? Thanks!
— Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3387268865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKFMPR4CXYMHODVGF7L3W23KNAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGOBXGI3DQOBWGU . You are receiving this because you were mentioned.Message ID: @.***>
@lattwood Were you able to give it another try? If not, we'll have to plan the required work if deemed.