dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

[BUG]: On application boot : failed to send, dropping 1 traces to intake at http://10.13.132.67:8126/v0.5/traces after 3 retries

Open JeremieDoctrine opened this issue 8 months ago • 9 comments

Tracer Version(s)

3.2.1

Python Version(s)

Python 3.12

Pip Version(s)

uv

Bug Report

When the application starts on a new Kubernetes node.

I've got the following error : "failed to send, dropping 1 traces to intake at http://10.13.132.67:8126/v0.5/traces after 3 retries"

I believe the pattern is the following:

Kubernetes wants to start a new deployment.

  1. There are not enough nodes.
  2. It starts a new node.
  3. The deployment container starts faster than the Datadog pod from the daemonset on this node.
  4. It sends traces before the Datadog container is ready.
  5. This logs errors.
  6. Then the daemonset becomes ready.
  7. It no longer logs errors.

It it possible to make the RETRY_ATTEMPTS configurable ? It is hardcoded to 3 attemps.

Not sure exactly why but our services in Node.js does not have this issue.

Reproduction Code

No response

Error Logs

failed to send, dropping 1 traces to intake at http://10.13.132.67:8126/v0.5/traces after 3 retries

Libraries in Use

No response

Operating System

No response

JeremieDoctrine avatar Apr 04 '25 11:04 JeremieDoctrine

Thanks for reporting this @JeremieDoctrine ! We don't have an option to support retries via an environment variable today so I'd consider this a feature request.

Instead, we're wondering if you can open a support ticket with us linking to this thread so we can study your Kubernetes deployment and understand why this seems to happen for Python but not Node.js: https://docs.datadoghq.com/help/ ?

wantsui avatar Apr 07 '25 18:04 wantsui

Thank you @JeremieDoctrine for this feature request, I am investigating the same issue at this moment. To make RETRY_ATTEMPTS configurable was my first thought. But it will not work, unfortunately. Because retries are only working on HTTP session, not on the Network layer. Take a look:

# time DD_TRACE_AGENT_TIMEOUT_SECONDS=10 DD_TRACE_DEBUG=true ddtrace-run python app.py
debug mode has been enabled for the ddtrace logger
....SKIPPED
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to http://localhost:8126 with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to http://localhost:8126 with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
....SKIPPED
- DATADOG TRACER DIAGNOSTIC - Agent not reachable at http://localhost:8126. Exception raised: [Errno 111] Connection refused
....SKIPPED
Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.
before_in_child hook _sample_before_fork was unregistered without first being registered

real	0m0.867s
user	0m0.625s
sys	0m0.121s

So, we have only 0.6 sec, and retries don't work with timeouts, don't work with Fibonachi Jitter from the code. Retries are useless if you have network issues. That's why people are suffering so many years with this: https://github.com/search?q=repo%3ADataDog%2Fdd-trace-py+failed+to+send%2C+dropping&type=issues

The proper code should wrap all network sessions to make retries helpful, something from ChatGPT (honestly, don't have time to contribute a proper fix, sorry)

# Replace or patch _flush_payloads
def _flush_payloads(self, events):
    if not events:
        return

    payload = self._encoder.encode(events)
    if not payload:
        return

    # Retry sending payload, including connection errors
    for attempt, delay in enumerate(self._backoff()):
        try:
            self._send_payload(payload)
            break  # Success
        except Exception as e:
            self.log.warning("Retry %d failed to send traces: %s", attempt + 1, e)
            time.sleep(delay)
    else:
        self.log.error("All retries failed for trace payload.")

def _backoff(self):
    base = 0.5
    for i in range(self.RETRY_ATTEMPTS):
        # Fibonacci-ish backoff + jitter
        yield base * (1.5 ** i) + random.uniform(0, 0.2)

Maybe @wantsui will invest some time to fix this very old network issue forever?

doctornkz-intelas avatar Apr 21 '25 21:04 doctornkz-intelas

For Node.js, we have the same behavior. I run the simplest code with Express library like this:

DD_TRACE_DEBUG=true  node --require dd-trace/init index.js

After I called curl localhost:3000 and got two output structures

Encoding payload: [{"trace_id":"4f375aa78dd90e09","span_id":"4f375aa78dd90e09","parent_id":"0000000000000000","name":"express.request","resource":"GET","error":0,"meta":{"_dd.p.tid":"6806d01200000000","_dd.p.dm":"-0",
.... SKIPPED
{"process_id":345,"_sampling_priority_v1":1},"start":1745276946011012000,"duration":104724365,"links":[],"service":"ddtrace-test"}]
>>> Error: [RC] Error in request
    at /ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/remote_config/manager.js:152:13
    at ClientRequest.<anonymous> (/ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/common/request.js:134:7)
    at Object.onceWrapper (node:events:622:26)
    at ClientRequest.emit (node:events:507:28)
    at req.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/http/client.js:109:25)
    at emitErrorEvent (node:_http_client:104:11)
    at Socket.socketErrorListener (node:_http_client:518:5)
    at Socket.emit (node:events:519:35)
    at Socket.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
    at emitErrorNT (node:internal/streams/destroy:170:8)
Error: connect ECONNREFUSED 127.0.0.1:8126
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1636:16)
    at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
  errno: -111,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 8126
}
Request to the agent: {"path":"/v0.4/traces","method":"PUT","headers":{"Content-Type":"application/msgpack","Datadog-Meta-Tracer-Version":"5.48.1","X-Datadog-Trace-Count":"1","Datadog-Meta-Lang":"nodejs","Datadog-Meta-Lang-Version":"v23.11.0","Datadog-Meta-Lang-Interpreter":"v8"},"url":"http://127.0.0.1:8126/"}
>>> Error: Error sending payload to the agent (status code: undefined)
    at /ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/agent/writer.js:45:13
    at /ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/agent/writer.js:106:5
    at ClientRequest.<anonymous> (/ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/common/request.js:134:7)
    at Object.onceWrapper (node:events:622:26)
    at ClientRequest.emit (node:events:507:28)
    at req.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/http/client.js:109:25)
    at emitErrorEvent (node:_http_client:104:11)
    at Socket.socketErrorListener (node:_http_client:518:5)
    at Socket.emit (node:events:519:35)
    at Socket.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
Error: connect ECONNREFUSED 127.0.0.1:8126
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1636:16)
    at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
  errno: -111,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 8126
}

It was dropped immediately after my call, and it seems like we have the same issue - if we can't create a TCP socket, we will fail with no retry.

It might be expected behavior, but in such a distributed infrastructure as K8S - network issues can happen during redeploy, node termination, rebalancing, draining, etc.

doctornkz-intelas avatar Apr 21 '25 23:04 doctornkz-intelas

OK maybe for node.js I was tricked because our services takes longer to boot.

JeremieDoctrine avatar Apr 25 '25 09:04 JeremieDoctrine

Thanks for your feedback! I'm raising this with the team.

One thing I'm wondering if you've explored is UDS instead of TCP to see if it improves the trace submission in Kubernetes?

wantsui avatar Apr 25 '25 13:04 wantsui

@wantsui Hello, sir. The same behaviour:

  1. sock file doesn't exist:
time DD_TRACE_AGENT_TIMEOUT_SECONDS=10 DD_TRACE_DEBUG=true DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket  ddtrace-run python app.py
debug mode has been enabled for the ddtrace logger
Configuration endpoint not set. Skipping fetching configuration.
...SKIPPED
creating new intake connection to unix:///var/run/datadog/apm.socket with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///var/run/datadog/apm.socket with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///var/run/datadog/apm.socket with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
...SKIPPED
- DATADOG TRACER DIAGNOSTIC - Agent not reachable at unix:///var/run/datadog/apm.socket. Exception raised: [Errno 2] No such file or directory
9f42d5898f1a
Exiting product 'tracer'
Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.
....SKIPPED
real	0m0.711s
user	0m0.513s
sys	0m0.102s
  1. sock file exist but has no client behind:
# touch /tmp/foo.sock
time DD_TRACE_AGENT_TIMEOUT_SECONDS=10 DD_TRACE_DEBUG=true DD_TRACE_AGENT_URL=unix:///tmp/foo.sock  ddtrace-run python app.py
debug mode has been enabled for the ddtrace logger
Configuration endpoint not set. Skipping fetching configuration.
....SKIPPED
creating new intake connection to unix:///tmp/foo.sock with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///tmp/foo.sock with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///tmp/foo.sock with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
....SKIPPED
- DATADOG TRACER DIAGNOSTIC - Agent not reachable at unix:///tmp/foo.sock. Exception raised: [Errno 111] Connection refused
....SKIPPED
real	0m0.735s
user	0m0.517s
sys	0m0.082s

ddtrace is dropping immediately with no respect for timeout, retries, etc.

doctornkz-intelas avatar Apr 25 '25 16:04 doctornkz-intelas

I've only ever tried to set up UDS on Kubernetes. For Kubernetes, there are two main parts to send traces via the unix domain socket instead of TCP:

  1. The Datadog Agent pod must be configured to listen for traces from UDS, seen at Configure the Datadog Agent to collect traces . Note: the socket needs to be available specifically to the Trace Agent pod. The instructions to enable it vary depending on whether you use helm or the Datadog Operator.
  2. The application pod with the python tracer needs to send traces to this same socket, following Configure your application pods to submit traces to Datadog Agent.

doctornkz-intelas, can you open a support ticket with details about the environment you're trying to set up UDS on so we can reproduce it but I'm also failing to find great documentation to show how to achieve with a local based Datadog Agent to follow (which is something we can improve on)? The error suggests the Datadog Agent isn't able to accept traces from UDS. If you start a ticket, our team can investigate further.

wantsui avatar Apr 25 '25 17:04 wantsui

@wantsui Sorry if I misled you, sir. I don't need UDS in my K8S. I am using a pure cluster agent setup with <service_dns_host>:8126 via TCP. I only wanted to point you (the Datadog team) to the issue with retries. Retries are useless if your datadog-agent is restarting, not ready, or terminated. Many users are getting the error dropping 1 traces to intake at http://BLAH_HOST:8126/v0.5/traces after 3 retries because there is no pause mechanism, all 3 retries are passing by ~50 milliseconds. Just it.

doctornkz-intelas avatar Apr 25 '25 17:04 doctornkz-intelas

Thanks, feedback noted @doctornkz-intelas ! I'll raise this with my team and see what they think about it!

wantsui avatar Apr 25 '25 17:04 wantsui

Any idea the cause of this error, as I'm starting to see it in my logs.

failed to send, dropping 1 traces to intake at http://localhost:8126/v0.5/traces after 3 retries

Chidinma123456 avatar Sep 08 '25 13:09 Chidinma123456