dd-trace-go icon indicating copy to clipboard operation
dd-trace-go copied to clipboard

fix(ddtracer/tracer): retry sendStats on EOF by marking POST idempotent

Open lattwood opened this issue 3 months ago • 7 comments

What does this PR do?

Addresses the edge-case of datadog-agent closing a persistent connection while dd-trace-go is trying to send it a request, resulting in error logs.

Encode the stats payload once, use a bytes.Reader for the body, and set req.GetBody + ContentLength. Also add an empty Idempotency-Key header (zero-length slice) so net/http.Transport treats the request as idempotent and can automatically retry on network errors when GetBody is available.

This fixes intermittent Post "...": EOF failures caused by a race where the agent (~5s idle timeout) closes an idle connection while the tracer (~90s) is sending- Write succeeds, then Read returns io.EOF.

Tests:

  • Add TestSendStatsNetworkErrorRetry with an eofConn that returns EOF after a successful Write to force the Transport retry path.
  • Assert two dials (retry on a new conn), /stats is retried, and Content-Length is set. The server sees three handler hits (/info & two /stats due to retry) while the client records two requests.

Result: sendStats transparently retries on transient EOFs; no behavior change on the success path.

Motivation

I was getting a ton of Post "...": EOF errors in my logs.

Reviewer's Checklist

  • [ ] Changed code has unit tests for its functionality at or near 100% coverage.
  • [ ] System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • [ ] There is a benchmark for any new code, or changes to existing code.
  • [ ] If this interacts with the agent in a new way, a system test has been added.
  • [ ] New code is free of linting errors. You can check this by running ./scripts/lint.sh locally.
  • [ ] Add an appropriate team label so this PR gets put in the right place for the release notes.
  • [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.

lattwood avatar Sep 24 '25 19:09 lattwood

@lattwood Thanks for the contribution. Allow us to ping the agent team and extra eyes from our guild. This looks good in a quick review, but we want to double check.

darccio avatar Sep 25 '25 07:09 darccio

In the meantime, can you check the test? Thanks!

--- FAIL: TestSendStatsNetworkErrorRetry (0.01s)
    transport_test.go:615: 
        	Error Trace:	/home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/transport_test.go:615
        	Error:      	Not equal: 
        	            	expected: 2
        	            	actual  : 3
        	Test:       	TestSendStatsNetworkErrorRetry

darccio avatar Sep 25 '25 07:09 darccio

will do!

i am going to have to start drinking if this is because of a macos/linux idiosyncrasy 🙃

On Thu, Sep 25, 2025 at 4:42 AM Dario Castañé @.***> wrote:

darccio left a comment (DataDog/dd-trace-go#3991) https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3332619203

In the meantime, can you check the test? Thanks!

--- FAIL: TestSendStatsNetworkErrorRetry (0.01s) transport_test.go:615: Error Trace: /home/runner/work/dd-trace-go/dd-trace-go/ddtrace/tracer/transport_test.go:615 Error: Not equal: expected: 2 actual : 3 Test: TestSendStatsNetworkErrorRetry

— Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3332619203, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKD3T2CCDZI3IEFWLKD3UOMHHAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMZSGYYTSMRQGM . You are receiving this because you were mentioned.Message ID: @.***>

lattwood avatar Sep 25 '25 12:09 lattwood

They would. I took a quick skim through the stats protobuf and it doesn't look like that would be an issue.

To fix this without a blind retry, we'd have to significantly increase the idle/read timeout from 3 seconds in the agent, which i'm guessing was done for a reason (the 90s here appears to be a go default).

hoping to get at the test failure tomorrow.

On Thu, Sep 25, 2025 at 12:11 PM Andrew Glaude @.***> wrote:

@.**** commented on this pull request.

In ddtrace/tracer/transport.go https://github.com/DataDog/dd-trace-go/pull/3991#discussion_r2379501000:

if err != nil {
  return err

} +

  • // by providing GetBody and a zero length slice for the Idempotency-Key

This seems like it could work well here, but wouldn't this mean all these requests are considered idempotent so even on a network error where the agent received and processed the request this would cause a retry of the same request? I took a look at the net/http docs for clarity here but they aren't clarifying this exact case for me

— Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-trace-go/pull/3991#pullrequestreview-3268273491, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKFMHH5GJE2MYTWE4T33UQA2DAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTENRYGI3TGNBZGE . You are receiving this because you were mentioned.Message ID: @.***>

lattwood avatar Sep 25 '25 20:09 lattwood

Hi @lattwood , Following up to see if you've made any progress with TestSendStatsNetworkErrorRetry? Thanks!

mtoffl01 avatar Oct 09 '25 19:10 mtoffl01

I haven't unfortunately, been fire fighting

On Thu, Oct 9, 2025 at 4:44 PM Mikayla Toffler @.***> wrote:

mtoffl01 left a comment (DataDog/dd-trace-go#3991) https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3387268865

Hi @lattwood https://github.com/lattwood , Following up to see if you've made any progress with TestSendStatsNetworkErrorRetry? Thanks!

— Reply to this email directly, view it on GitHub https://github.com/DataDog/dd-trace-go/pull/3991#issuecomment-3387268865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKFMPR4CXYMHODVGF7L3W23KNAVCNFSM6AAAAACHNFIZASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGOBXGI3DQOBWGU . You are receiving this because you were mentioned.Message ID: @.***>

lattwood avatar Oct 09 '25 20:10 lattwood

@lattwood Were you able to give it another try? If not, we'll have to plan the required work if deemed.

darccio avatar Nov 21 '25 11:11 darccio