[BUG]: On application boot : failed to send, dropping 1 traces to intake at http://10.13.132.67:8126/v0.5/traces after 3 retries
Tracer Version(s)
3.2.1
Python Version(s)
Python 3.12
Pip Version(s)
uv
Bug Report
When the application starts on a new Kubernetes node.
I've got the following error : "failed to send, dropping 1 traces to intake at http://10.13.132.67:8126/v0.5/traces after 3 retries"
I believe the pattern is the following:
Kubernetes wants to start a new deployment.
- There are not enough nodes.
- It starts a new node.
- The deployment container starts faster than the Datadog pod from the daemonset on this node.
- It sends traces before the Datadog container is ready.
- This logs errors.
- Then the daemonset becomes ready.
- It no longer logs errors.
It it possible to make the RETRY_ATTEMPTS configurable ? It is hardcoded to 3 attemps.
Not sure exactly why but our services in Node.js does not have this issue.
Reproduction Code
No response
Error Logs
failed to send, dropping 1 traces to intake at http://10.13.132.67:8126/v0.5/traces after 3 retries
Libraries in Use
No response
Operating System
No response
Thanks for reporting this @JeremieDoctrine ! We don't have an option to support retries via an environment variable today so I'd consider this a feature request.
Instead, we're wondering if you can open a support ticket with us linking to this thread so we can study your Kubernetes deployment and understand why this seems to happen for Python but not Node.js: https://docs.datadoghq.com/help/ ?
Thank you @JeremieDoctrine for this feature request, I am investigating the same issue at this moment. To make RETRY_ATTEMPTS configurable was my first thought. But it will not work, unfortunately. Because retries are only working on HTTP session, not on the Network layer. Take a look:
# time DD_TRACE_AGENT_TIMEOUT_SECONDS=10 DD_TRACE_DEBUG=true ddtrace-run python app.py
debug mode has been enabled for the ddtrace logger
....SKIPPED
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to http://localhost:8126 with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to http://localhost:8126 with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
....SKIPPED
- DATADOG TRACER DIAGNOSTIC - Agent not reachable at http://localhost:8126. Exception raised: [Errno 111] Connection refused
....SKIPPED
Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.
before_in_child hook _sample_before_fork was unregistered without first being registered
real 0m0.867s
user 0m0.625s
sys 0m0.121s
So, we have only 0.6 sec, and retries don't work with timeouts, don't work with Fibonachi Jitter from the code. Retries are useless if you have network issues. That's why people are suffering so many years with this: https://github.com/search?q=repo%3ADataDog%2Fdd-trace-py+failed+to+send%2C+dropping&type=issues
The proper code should wrap all network sessions to make retries helpful, something from ChatGPT (honestly, don't have time to contribute a proper fix, sorry)
# Replace or patch _flush_payloads
def _flush_payloads(self, events):
if not events:
return
payload = self._encoder.encode(events)
if not payload:
return
# Retry sending payload, including connection errors
for attempt, delay in enumerate(self._backoff()):
try:
self._send_payload(payload)
break # Success
except Exception as e:
self.log.warning("Retry %d failed to send traces: %s", attempt + 1, e)
time.sleep(delay)
else:
self.log.error("All retries failed for trace payload.")
def _backoff(self):
base = 0.5
for i in range(self.RETRY_ATTEMPTS):
# Fibonacci-ish backoff + jitter
yield base * (1.5 ** i) + random.uniform(0, 0.2)
Maybe @wantsui will invest some time to fix this very old network issue forever?
For Node.js, we have the same behavior. I run the simplest code with Express library like this:
DD_TRACE_DEBUG=true node --require dd-trace/init index.js
After I called curl localhost:3000 and got two output structures
Encoding payload: [{"trace_id":"4f375aa78dd90e09","span_id":"4f375aa78dd90e09","parent_id":"0000000000000000","name":"express.request","resource":"GET","error":0,"meta":{"_dd.p.tid":"6806d01200000000","_dd.p.dm":"-0",
.... SKIPPED
{"process_id":345,"_sampling_priority_v1":1},"start":1745276946011012000,"duration":104724365,"links":[],"service":"ddtrace-test"}]
>>> Error: [RC] Error in request
at /ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/remote_config/manager.js:152:13
at ClientRequest.<anonymous> (/ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/common/request.js:134:7)
at Object.onceWrapper (node:events:622:26)
at ClientRequest.emit (node:events:507:28)
at req.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/http/client.js:109:25)
at emitErrorEvent (node:_http_client:104:11)
at Socket.socketErrorListener (node:_http_client:518:5)
at Socket.emit (node:events:519:35)
at Socket.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
at emitErrorNT (node:internal/streams/destroy:170:8)
Error: connect ECONNREFUSED 127.0.0.1:8126
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1636:16)
at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
errno: -111,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '127.0.0.1',
port: 8126
}
Request to the agent: {"path":"/v0.4/traces","method":"PUT","headers":{"Content-Type":"application/msgpack","Datadog-Meta-Tracer-Version":"5.48.1","X-Datadog-Trace-Count":"1","Datadog-Meta-Lang":"nodejs","Datadog-Meta-Lang-Version":"v23.11.0","Datadog-Meta-Lang-Interpreter":"v8"},"url":"http://127.0.0.1:8126/"}
>>> Error: Error sending payload to the agent (status code: undefined)
at /ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/agent/writer.js:45:13
at /ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/agent/writer.js:106:5
at ClientRequest.<anonymous> (/ddtrace-test/node_modules/dd-trace/packages/dd-trace/src/exporters/common/request.js:134:7)
at Object.onceWrapper (node:events:622:26)
at ClientRequest.emit (node:events:507:28)
at req.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/http/client.js:109:25)
at emitErrorEvent (node:_http_client:104:11)
at Socket.socketErrorListener (node:_http_client:518:5)
at Socket.emit (node:events:519:35)
at Socket.emit (/ddtrace-test/node_modules/dd-trace/packages/datadog-instrumentations/src/net.js:69:25)
Error: connect ECONNREFUSED 127.0.0.1:8126
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1636:16)
at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
errno: -111,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '127.0.0.1',
port: 8126
}
It was dropped immediately after my call, and it seems like we have the same issue - if we can't create a TCP socket, we will fail with no retry.
It might be expected behavior, but in such a distributed infrastructure as K8S - network issues can happen during redeploy, node termination, rebalancing, draining, etc.
OK maybe for node.js I was tricked because our services takes longer to boot.
Thanks for your feedback! I'm raising this with the team.
One thing I'm wondering if you've explored is UDS instead of TCP to see if it improves the trace submission in Kubernetes?
@wantsui Hello, sir. The same behaviour:
- sock file doesn't exist:
time DD_TRACE_AGENT_TIMEOUT_SECONDS=10 DD_TRACE_DEBUG=true DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket ddtrace-run python app.py
debug mode has been enabled for the ddtrace logger
Configuration endpoint not set. Skipping fetching configuration.
...SKIPPED
creating new intake connection to unix:///var/run/datadog/apm.socket with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///var/run/datadog/apm.socket with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///var/run/datadog/apm.socket with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
...SKIPPED
- DATADOG TRACER DIAGNOSTIC - Agent not reachable at unix:///var/run/datadog/apm.socket. Exception raised: [Errno 2] No such file or directory
9f42d5898f1a
Exiting product 'tracer'
Waiting 5 seconds for tracer to finish. Hit ctrl-c to quit.
....SKIPPED
real 0m0.711s
user 0m0.513s
sys 0m0.102s
- sock file exist but has no client behind:
# touch /tmp/foo.sock
time DD_TRACE_AGENT_TIMEOUT_SECONDS=10 DD_TRACE_DEBUG=true DD_TRACE_AGENT_URL=unix:///tmp/foo.sock ddtrace-run python app.py
debug mode has been enabled for the ddtrace logger
Configuration endpoint not set. Skipping fetching configuration.
....SKIPPED
creating new intake connection to unix:///tmp/foo.sock with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///tmp/foo.sock with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
creating new intake connection to unix:///tmp/foo.sock with timeout 10
Sending request: PUT v0.5/traces {'Datadog-Meta-Lang': 'python', 'Datadog-Meta-Lang-Version': '3.13.3', 'Datadog-Meta-Lang-Interpreter': 'CPython', 'Datadog-Meta-Tracer-Version': '3.5.1', 'Datadog-Client-Computed-Top-Level': 'yes', 'Content-Type': 'application/msgpack', 'X-Datadog-Trace-Count': '1'}
....SKIPPED
- DATADOG TRACER DIAGNOSTIC - Agent not reachable at unix:///tmp/foo.sock. Exception raised: [Errno 111] Connection refused
....SKIPPED
real 0m0.735s
user 0m0.517s
sys 0m0.082s
ddtrace is dropping immediately with no respect for timeout, retries, etc.
I've only ever tried to set up UDS on Kubernetes. For Kubernetes, there are two main parts to send traces via the unix domain socket instead of TCP:
- The Datadog Agent pod must be configured to listen for traces from UDS, seen at Configure the Datadog Agent to collect traces . Note: the socket needs to be available specifically to the Trace Agent pod. The instructions to enable it vary depending on whether you use helm or the Datadog Operator.
- The application pod with the python tracer needs to send traces to this same socket, following Configure your application pods to submit traces to Datadog Agent.
doctornkz-intelas, can you open a support ticket with details about the environment you're trying to set up UDS on so we can reproduce it but I'm also failing to find great documentation to show how to achieve with a local based Datadog Agent to follow (which is something we can improve on)? The error suggests the Datadog Agent isn't able to accept traces from UDS. If you start a ticket, our team can investigate further.
@wantsui Sorry if I misled you, sir. I don't need UDS in my K8S. I am using a pure cluster agent setup with <service_dns_host>:8126 via TCP. I only wanted to point you (the Datadog team) to the issue with retries. Retries are useless if your datadog-agent is restarting, not ready, or terminated. Many users are getting the error dropping 1 traces to intake at http://BLAH_HOST:8126/v0.5/traces after 3 retries because there is no pause mechanism, all 3 retries are passing by ~50 milliseconds. Just it.
Thanks, feedback noted @doctornkz-intelas ! I'll raise this with my team and see what they think about it!
Any idea the cause of this error, as I'm starting to see it in my logs.
failed to send, dropping 1 traces to intake at http://localhost:8126/v0.5/traces after 3 retries