dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Failed to open TCP connection after upgrading to v0.54.2

Open agrobbin opened this issue 3 years ago • 15 comments

After upgrading to ddtrace v0.54.2, we started receiving these errors during app shutdown:

E, [2022-01-21T18:52:02.250127 #4] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/3.0.0/gems/ddtrace-0.54.2/lib/ddtrace/transport/http/client.rb:37:in `rescue in send_request') Internal error during HTTP transport request. Cause: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126) Location: /app/vendor/ruby-3.0.3/lib/ruby/3.0.0/net/http.rb:987:in `initialize'
E, [2022-01-21T18:52:04.189624 #4] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/3.0.0/gems/ddtrace-0.54.2/lib/ddtrace/transport/http/client.rb:37:in `rescue in send_request') Internal error during HTTP transport request. Cause: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126) Location: /app/vendor/ruby-3.0.3/lib/ruby/3.0.0/net/http.rb:987:in `initialize'

We upgraded ddtrace on 1/21, and starting with the first restart (which does a shutdown of our Heroku dynos before booting up new ones) after that, this error started showing up in our logs. It always seems to occur twice per shutdown. The version of our agent did not change from 7.32.4-1.

I wish I had more to go on here, but hopefully someone has a hunch about what might be going on here! Thanks so much.

agrobbin avatar Feb 05 '22 17:02 agrobbin

Hey @agrobbin, do you know what version of ddtrace you've upgraded from, so we can narrow our search domain?

marcotc avatar Feb 15 '22 16:02 marcotc

@marcotc yep! We went from 0.54.1 to 0.54.2, which also included the following gem updates:

  • debase-ruby_core_source 0.10.12 -> 0.10.14
  • msgpack 1.4.2 -> 1.4.3

agrobbin avatar Feb 15 '22 16:02 agrobbin

Thanks for the info @agrobbin.

The changes between 0.54.1 and 0.54.2 only really affect profiling (https://my.diffend.io/gems/ddtrace/0.54.1/0.54.2).

The log you shared indicates an issue reporting tracing data (lib/ddtrace/transport/http/client.rb:37) which... is a bit unexpected 🤔 .

Can you give us the following information so we can investigate further:

  1. Is profiling enabled for the application where you saw this issue?
  2. Could you share your configuration (Datadog.configure / DD_ environment variables)?
  3. Did you end up staying on 0.54.2, or did you revert back to 0.54.1? If you reverted back, did the message disappear?

ivoanjo avatar Feb 16 '22 09:02 ivoanjo

Happy to share some more details! We do not have profiling enabled (we trialed it at some point in the past, months ago, but it is not currently enabled). Here is our configuration:

Datadog.configure do |c|
  c.diagnostics.startup_logs.enabled = false
  c.runtime_metrics.enabled = true # Private Beta
  c.sampling.default_rate = 1.0

  if Rails.configuration.x.datadog_enabled
    c.use :aws, service_name: '[filtered]-aws'
    c.use :faraday, split_by_domain: true
    c.use :httpclient, split_by_domain: true
    c.use :http, split_by_domain: true
    c.use :rails, log_injection: true, service_name: '[filtered]', job_service: '[filtered]-worker'
    c.use :redis, service_name: '[filtered]-redis'
    c.use :rest_client, split_by_domain: true
    c.use :sidekiq, service_name: '[filtered]-worker', client_service_name: '[filtered]-worker'
  else
    c.tracer.enabled = false
  end
end

config.x.datadog_enabled is implemented as follows:

config.x.datadog_enabled = ENV['DISABLE_DATADOG_AGENT'].blank? && ENV['DD_API_KEY'].present?

And here are our DD_* env variables:

DD_AGENT_MAJOR_VERSION=7
DD_API_KEY=[filtered]
DD_APM_NON_LOCAL_TRAFFIC=true
DD_DYNO_HOST=true
DD_ENV=production
DD_LOG_LEVEL=error
DD_PROCESS_AGENT=true
DD_RUM_APPLICATION_ID=[filtered]
DD_RUM_CLIENT_TOKEN=[filtered]

The DD_RUM_* env variables are our own convention, and I don't believe are looked at by dd-trace, just in case there was any confusion there! We are running in Heroku, on the heroku-20 stack, with the latest DataDog buildpack (v7.33.0-1).

I have not reverted us back, but I can do that in our staging environment (all the same config, except DD_ENV of course) and see if the error disappears.

agrobbin avatar Feb 16 '22 13:02 agrobbin

I was able to try this in our staging environment, and it does seem to be happening on v0.54.1 as well. It's definitely weird that we didn't see this output previously, but it's definitely possible I simply mistyped something when doing my log searches previously!

E, [2022-02-16T17:55:16.539285 #4] ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/3.1.0/gems/ddtrace-0.54.1/lib/ddtrace/transport/http/client.rb:37:in `rescue in send_request') Internal error during HTTP transport request. Cause: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for 127.0.0.1:8126) Location: /app/vendor/ruby-3.1.0/lib/ruby/3.1.0/socket.rb:1214:in `__connect_nonblock'

agrobbin avatar Feb 16 '22 17:02 agrobbin

Cool, thanks for experimenting with 0.54.1 and confirming that it happens there as well.

Since you mentioned these logs show up on app shutdown, I wonder if the issue is related to some kind of race, where heroku signals both the datadog agent as well as the app to shut down, but the agent cleans up and finishes faster. Hmmm....

I'll need to loop in someone else to help with this investigation. Let me get back to you on next steps :)

ivoanjo avatar Feb 17 '22 10:02 ivoanjo

Hey @agrobbin sorry for the delay getting back to you.

Would you mind reaching out to support via https://www.datadoghq.com/support/ and including a link to this ticket?

This way support can investigate the ticket in the context of your account and specific agent setup.

ivoanjo avatar Feb 23 '22 15:02 ivoanjo

@ivoanjo done!

agrobbin avatar Feb 23 '22 18:02 agrobbin

hello @agrobbin, any news on how to fix that issue?

I receive the same error on my Heroku review apps, even when DISABLE_DATADOG_AGENT=true

akapronik avatar Apr 12 '22 14:04 akapronik

@akapronik hopefully @agrobbin can chime in with his learnings, but I do still recommend pinging support via https://www.datadoghq.com/support/ , as failures to contact the agent may be caused by a number of different causes.

ivoanjo avatar Apr 12 '22 14:04 ivoanjo

@akapronik @ivoanjo unfortunately I don't have any news here. I raised the issue with Datadog Support, and the last update I received was that Engineering is looking into the issue. Sorry to not be of more assistance!

agrobbin avatar Apr 13 '22 18:04 agrobbin

I am also receiving this error on Cloud Foundry (which is similar to Heroku) on 0.54.2.

gem 'ddtrace', require: 'ddtrace/auto_instrument'
gem 'dogstatsd-ruby'
    DD_TRACE_ENABLED: true
    DD_RUNTIME_METRICS_ENABLED: true

Note I don't use an initialiser. The error occurs during deployment when the application is starting.

WARN -- ddtrace: [ddtrace] DATADOG TRACER DIAGNOSTIC - Agent Error: Datadog::Transport::InternalErrorResponse ok?: unsupported?:, not_found?:, client_error?:, server_error?:, internal_error?:true, payload:, error_type:Errno::ECONNREFUSED error:Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126)
-- | --
ERROR -- ddtrace: [ddtrace] (/home/vcap/deps/1/vendor_bundle/ruby/2.7.0/gems/ddtrace-0.54.2/lib/ddtrace/transport/http/client.rb:37:in `rescue in send_request') Internal error during HTTP transport request. Cause: Failed to open TCP connection to 127.0.0.1:8126 (Connection refused - connect(2) for "127.0.0.1" port 8126) Location: /home/vcap/deps/1/ruby/lib/ruby/2.7.0/net/http.rb:960:in `initialize'

Are there any updates to this issue?

sterankin avatar Apr 29 '22 09:04 sterankin

@sterankin you mention "during deployment"; I wonder if your situation is similar to the Heroku issue where during push/buildpack build a few errors can show up in the log (see https://docs.datadoghq.com/agent/guide/heroku-ruby/#traces )...?

If you see the issue after deployment, when the service is running normally, then that's not expected at all, and I would also ask you to file a ticket with support so that we can look into specifics of your setup.

Unfortunately the "Failed to open TCP connection" error itself can be caused by many things:

  • The Datadog agent is not running at all
  • There is some misconfiguration preventing dd-trace-rb from reaching the Datadog agent
  • There are networking issues
  • Bug in dd-trace-rb, although the agent reporting code is something we don't touch often and hasn't been changed recently
  • Bug in the Datadog agent
  • etc...

This is why I advise filing a ticket so that we can contextually look into the root cause in each situation, which can be quite different 😄

ivoanjo avatar Apr 29 '22 13:04 ivoanjo

@sterankin you mention "during deployment"; I wonder if your situation is similar to the Heroku issue where during push/buildpack build a few errors can show up in the log (see https://docs.datadoghq.com/agent/guide/heroku-ruby/#traces )...?

If you see the issue after deployment, when the service is running normally, then that's not expected at all, and I would also ask you to file a ticket with support so that we can look into specifics of your setup.

Unfortunately the "Failed to open TCP connection" error itself can be caused by many things:

  • The Datadog agent is not running at all
  • There is some misconfiguration preventing dd-trace-rb from reaching the Datadog agent
  • There are networking issues
  • Bug in dd-trace-rb, although the agent reporting code is something we don't touch often and hasn't been changed recently
  • Bug in the Datadog agent
  • etc...

This is why I advise filing a ticket so that we can contextually look into the root cause in each situation, which can be quite different 😄

This could be correct, I will open a ticket but it could be a different issue than the op 👍

sterankin avatar Apr 29 '22 15:04 sterankin

FYI, I added these variables to Heroku production environments, and the issue has gone.

DD_TRACE_ENABLED=true
DD_APM_ENABLED=true

for staging, I added this, and it also fixes the issue

DD_TRACE_ENABLED=false

It works with the latest ddtrace version 0.54.2

akapronik avatar May 03 '22 10:05 akapronik

We are receiving a similar error with our ddtrace on our Rails app on Heroku, but as far as I know its limited to the console. When connect to our app's Rails console via Heroku CLI, the following message is generated about once every minute:

ERROR -- ddtrace: [ddtrace] (/app/vendor/bundle/ruby/2.7.0/gems/ddtrace-1.15.0/lib/datadog/profiling/http_transport.rb:60:in `export') Failed to report profiling data ({:agent=>"http://127.0.0.1:8126/"}): failed ddog_prof_Exporter_send: error trying to connect: tcp connect error: Connection refused (os error 111): tcp connect error: Connection refused (os error 111): Connection refused (os error 111)

Unfortunately the DevOps that set this up is no longer with the company, so I'm doing my best to resolve this annoyance while learning the basics of Datadog. We are using v1.15 of the ddtrace gem.

I tried @akapronik 's suggestion of disabling the DD_TRACE_ENABLED env var but that does not appear to work.

As much as I can tell, the profiler is working - I see data when I view the Profiling page in Datadog.

johnpitchko avatar Feb 07 '24 22:02 johnpitchko

Hey @johnpitchko :wave:

I suspect what's happening is that when running the heroku CLI, the Datadog agent doesn't get started (which seems reasonable -- you may not want to get data from the console).

To disable the profiler just for that environment, you can use the DD_PROFILING_ENABLED environment variable; the variable you mentioned only controls distributed tracing -- although this is a bit of a historical naming confusion.

Let me know if this works for you!

ivoanjo avatar Feb 08 '24 09:02 ivoanjo

Hey @johnpitchko, hopefully my suggestion was helpful and no news is good news! Let us know if we can help any further :)

Since this ticket was originally about a now-old ddtrace version (0.54.2), I'm going to go ahead and close it. Please do open a new issue if you see this again / run into any other issue :)

ivoanjo avatar Feb 20 '24 13:02 ivoanjo

@ivoanjo I was successful in resolving this issue a few weeks ago - apologies for not responding sooner and thank you for your input!

johnpitchko avatar Feb 20 '24 18:02 johnpitchko

Nice, glad to know! :)

ivoanjo avatar Feb 20 '24 19:02 ivoanjo