dd-trace-rb Resque workers start hanging when sending data to hostname

We encountered some issues when sending traces from resque workers to a "centralized" location. That location would be multiple agents deployed as ECS tasks in AWS with service discover enabled.

In our datadog configuration we set the hostname to the DNS record of the tasks mentioned in the beginning. This worked fine for all the other services sending traces but the resque workers started failing after a while.

Our issue is almost identical to this one: I saw this very similar issue: https://github.com/DataDog/dd-trace-rb/issues/466 but other than setting the hostname as "127.0.0.1" there was no other resolution. Is this to be expected for resque workers? That configuring the transport layer as highlighted here would not work and cause this issue?

When we don't use a custom hostname to send traces everything works fine. ddtrace version 0.54.2.

Thanks.

Nov 17 '22 13:11 bpastiu

👋 @bpastiu , thanks for reporting.

It seems to be a DNS failure. Could you provide more diagnostic information for us? like which version of resque and how do you configure and use the tracer. How did you find out the worker is hanging and we are also interested about your cloud environment.

Nov 21 '22 17:11 TonyCTHsu

Hi, thanks for responding.

While I can't paste any detailed code snippets I can try and explain our use case. The version of resque we use is: 0.54.2. We are configuring and using the tracer by setting the hostname to a route 53 record which points to multiple fargate tasks running the datadog agent. These datadog agent tasks use ECS service discovery. We configure it using:

c.tracer :hostname => <DNS_RECORD>

The diagnostic of this issue was no simple task. It took a few days to figure it out and it only came to light when some engineers posted some gdb dumps stating that the resque worker hangs when trying to communicate with the datadog agent using DNS. Notes: This was working for any other services. Just resque showed issues. This issue does not reproduce immediately, it rather takes a few hours to come to light.

Nov 23 '22 10:11 bpastiu

This sounds similar to https://github.com/DataDog/dd-trace-rb/issues/3015 and to upstream issues https://github.com/sidekiq/sidekiq/issues/2175 and https://github.com/resque/resque/issues/1101.

@bpastiu you may want to try disabling FORK_PER_JOB as discussed on this thread.

Aug 08 '23 11:08 ivoanjo

dd-trace-rb dd-trace-rb copied to clipboard

Resque workers start hanging when sending data to hostname

dd-trace-rb
dd-trace-rb copied to clipboard