dd-trace-rb
dd-trace-rb copied to clipboard
Resque workers start hanging when sending data to hostname
We encountered some issues when sending traces from resque workers to a "centralized" location. That location would be multiple agents deployed as ECS tasks in AWS with service discover enabled.
In our datadog configuration we set the hostname to the DNS record of the tasks mentioned in the beginning. This worked fine for all the other services sending traces but the resque workers started failing after a while.
Our issue is almost identical to this one: I saw this very similar issue: https://github.com/DataDog/dd-trace-rb/issues/466 but other than setting the hostname as "127.0.0.1" there was no other resolution. Is this to be expected for resque workers? That configuring the transport layer as highlighted here would not work and cause this issue?
When we don't use a custom hostname to send traces everything works fine. ddtrace version 0.54.2.
Thanks.
👋 @bpastiu , thanks for reporting.
It seems to be a DNS failure. Could you provide more diagnostic information for us? like which version of resque
and how do you configure and use the tracer. How did you find out the worker is hanging and we are also interested about your cloud environment.
Hi, thanks for responding.
While I can't paste any detailed code snippets I can try and explain our use case.
The version of resque
we use is: 0.54.2.
We are configuring and using the tracer by setting the hostname to a route 53 record which points to multiple fargate tasks running the datadog agent
.
These datadog agent
tasks use ECS service discovery.
We configure it using:
c.tracer :hostname => <DNS_RECORD>
The diagnostic of this issue was no simple task. It took a few days to figure it out and it only came to light when some engineers posted some gdb dumps stating that the resque
worker hangs when trying to communicate with the datadog
agent using DNS.
Notes:
This was working for any other services. Just resque
showed issues.
This issue does not reproduce immediately, it rather takes a few hours to come to light.
This sounds similar to https://github.com/DataDog/dd-trace-rb/issues/3015 and to upstream issues https://github.com/sidekiq/sidekiq/issues/2175 and https://github.com/resque/resque/issues/1101.
@bpastiu you may want to try disabling FORK_PER_JOB
as discussed on this thread.