fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

DNS resolution timeout/failure in 1.8.9

Open stevenarvar opened this issue 3 years ago • 26 comments

Bug Report

Describe the bug Hi, I am facing a DNS resolution timeout/failure using 1.8.9 with the forward module to a stackdriver.

To Reproduce

  • upgrade from 1.8.3 to 1.8.9
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [engine] chunk '1-1635532763.608049522.flb' cannot be retried: task_id=41, input=standard_log_emitted > output=stackdriver.1
[2021/10/29 18:43:41] [ warn] [input] emitter.8 paused (mem buf overlimit)
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01

Your Environment

  • Version used: 1.8.9
  • Configuration: stackdriver output and tail input
  • Environment name and version (e.g. Kubernetes? What version?): K8S 1.19
  • Filters and plugins: stackdriver & tail

Additional context Some fluent-bit pods eventually output logs such as Resource temporarily unavailable and gave up:

[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1 

stevenarvar avatar Oct 29 '21 19:10 stevenarvar

Issue could be related to https://github.com/fluent/fluent-bit/issues/4050

stevenarvar avatar Oct 29 '21 19:10 stevenarvar

Please also see: https://github.com/fluent/fluent-bit/issues/4257

matthewfala avatar Oct 29 '21 20:10 matthewfala

I have a similar issue by using fluent/fluent-bit:1.8.9-debug, fluent-bit can not resolve the headless service in AKS to forward the logs to flunetd statefulset. fluent/fluent-bit:1.8.4-debug image does not give those errors, unfortunately, I have to downgrade till there will be a fix for this bug, tested nslookup in fluent-bit and still can not resolve, I tested nslookup with an ubuntu image and it works, is there any chance

fluentbit logs:

[2021/11/15 15:14:49] [ warn] [engine] failed to flush chunk '1-1636989288.537725817.flb', retry in 10 seconds: task_id=1, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:50] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:50] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:50] [ warn] [engine] failed to flush chunk '1-1636989289.606633533.flb', retry in 8 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.988513419.flb', retry in 8 seconds: task_id=8, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.511459650.flb', retry in 7 seconds: task_id=4, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.704430196.flb', retry in 7 seconds: task_id=6, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.1] no upstream connections available

sdwerwed avatar Nov 15 '21 14:11 sdwerwed

I get the same error with the Elasticsearch output when configuring it with the Cloud_ID and Cloud_Auth config in both Minikube and AKS. I tried with multiple versions: 1.8.x series: I get the Domain Not Found error described above 1.7.9: I get an Unknown error.

The exact error (with v1.8.10) is:

[2021/11/22 20:03:17] [ warn] [net] getaddrinfo(host='***redacted***.azure.elastic-cloud.com:9243', err=4): Domain name not found
[2021/11/22 20:03:17] [debug] [upstream] connection #-1 failed to **redacted***.azure.elastic-cloud.com:9243:443

So I have two observations:

  1. Maybe Fluent Bit is doing a DNS lookup using the host AND the port? The correct lookup would use only the host, not the port concatenated to it.
  2. It tries to connect to the host using two ports: The correct one and 443. I tried to set the port manually in addition to the Cloud_ID setting, but I get the same result.

bensta avatar Nov 22 '21 19:11 bensta

On 1.8.10 sporadically reproduced too [input:emitter:emitter_for_rewrite_tag.6] error registering chunk with tag:

urpyLLIKa avatar Nov 23 '21 06:11 urpyLLIKa

I tried the helm chart and manual installation but I have the same problem. is there any solution?

  • fluent-bit version 1.8.9
  • running on AKS.

[2022/02/07 20:21:18] [ warn] [net] getaddrinfo(host='*****.westeurope.azure.elastic-cloud.com:9243', err=4): Domain name not found [2022/02/07 20:21:18] [ warn] [engine] failed to flush chunk '1-1644265237.698532458.flb', retry in 16 seconds: task_id=363, input=tail.0 > output=es.0 (out_id=0)

ehelvacikoylu avatar Feb 07 '22 20:02 ehelvacikoylu

Hello,

I had the same issue. Is there some resolution available ? This problem will be fixed?

Regards,

jcamu avatar Mar 02 '22 16:03 jcamu

Can you retest with the latest 1.8 (1.8.13 currently) or 1.9.0 release? There have been various fixes around DNS.

patrick-stephens avatar Mar 14 '22 21:03 patrick-stephens

@patrick-stephens Issue also occurs in 1.9.3.

[2022/05/13 09:13:27] [ warn] [net] getaddrinfo(host='xyz.westeurope.azure.elastic-cloud.com:9243', err=4):
  Domain name not found

@bensta I think you are right. When I issue a curl inside the fluent-debug container then a response is returned. If the issue would be related to the kube-dns or resolving then a resolving error should be returned by curl as well.

030 avatar May 13 '22 09:05 030

Related: https://stackoverflow.com/q/69405701/2777965

030 avatar May 13 '22 09:05 030

Our Cloud ID has been changed automatically recently from europe-west4.gcp.elastic-cloud.com$... to europe-west1.gcp.cloud.es.io:443$... (the decoded representation here) and all fluentbits stopped working (1.8.15) with

[ warn] [net] getaddrinfo(host='....europe-west4.gcp.elastic-cloud.com:443', err=4): Domain name not found

andrewnazarov avatar Jun 29 '22 20:06 andrewnazarov

@030 This issue still could be seen with the latest version. Ping for some attention. Try to confirm if the PR mentioned above could fix this issue or not and when do you plan to release it?

DrewZhang13 avatar Jul 11 '22 22:07 DrewZhang13

Not sure how much helpful is this, but I've faced the same problem for elastic cloud.As a workaround, Instead of using Cloud-ID & Cloud-Auth, append the cloud config in regular ES format :

        host domain-example.aws.elastic-cloud.com
        port 443
        HTTP_User elastic
        HTTP_Passwd es-password

tanmay-bhat avatar Jul 18 '22 17:07 tanmay-bhat

@DrewZhang13 Processed the PR feedback. We will have to wait until it is reviewed by one of the maintainers.

030 avatar Jul 19 '22 14:07 030

@leonardo-albertovich We are still seeing customers who report this issue with elasticsearch and say that neither net.dns.mode setting (TCP and UDP) helps.

Example config:

[OUTPUT]
        Name  es
        Match kube.*
        Host  my-host
        Port  9200
        Logstash_Format true
        Logstash_Prefix logstash
        Logstash_DateFormat %Y.%m.%d
        Time_Key @timestamp
        tls   Off
        net.dns.mode UDP # TCP doesn't work either

PettitWesley avatar Jul 19 '22 17:07 PettitWesley

Which fluent-bit version are they running?

leonardo-albertovich avatar Jul 19 '22 19:07 leonardo-albertovich

@leonardo-albertovich 1.8.9, same as reported in this issue.

PettitWesley avatar Jul 19 '22 21:07 PettitWesley

We are also seeing this same issue with fluent-bit v1.9.3. We are seeing repeated log messages like the following, and fluent-bit does not upload logs using the Stackdriver output plugin:

[2022/07/25 21:59:25] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers

I have not yet tried ndt.dns.mode=TCP, but am in the process of trying that. I will also note that inside of our fluent-bit containers the default resolver setting of ndots:5 is in place, and we are actively testing setting that to ndots:2. With ndots:5, we saw that fluent-bit was issuing 4 DNS queries for every external name resolution (e.g., "logging.googleapis.com"), which is not optimal at all. We have not yet got ndots:2 into our production platform, but I will report back here whether ndt.dns.mode=TCP in conjunction with ndots:2 will help us with this issue.

Side note: our platform is very geographically dispersed (all over the globe). Anecdotally, we are seeing these DNS issues with fluent-bit instances running on nodes that are very far away geographically from the VMs where CoreDNS is running in our clusters. We have been hypothesizing that the repeated failing DNS queries caused by ndots:5 on top of UDP packets traversing large distances, many hops and many networks, might be at least partially to blame for the DNS issues we are seeing with fluent-bit. Again, I will report back here once I have more details about how these mitigations help us, or not.

nkinkade avatar Jul 27 '22 21:07 nkinkade

@nkinkade I had to test stackdriver in GCE yesterday and in order to get it to work properly there you need to add this option dns.prefer_ipv4 on, that makes fluent-bit prioritize ipv4 results when querying the nameserver which works around the underlying issue with gce not allowing ipv6 connections to that address from those networks (I don't remember the exact cause because it's been a long time).

leonardo-albertovich avatar Aug 12 '22 19:08 leonardo-albertovich

@leonardo-albertovich How come all DNS settings are not documented? https://github.com/fluent/fluent-bit/blob/master/src/flb_upstream.c#L43

PettitWesley avatar Aug 12 '22 19:08 PettitWesley

I think some of those settings catered some very specific corner cases and weren't meant for general usage.

leonardo-albertovich avatar Aug 12 '22 19:08 leonardo-albertovich

@leonardo-albertovich I will bring this up next time we have some sort of community or other meeting, IMO we should not have special hidden settings that only some maintainers understand and know about. If a setting needs a warning attached to it or some caveats, sure, that makes sense, but anything that exists should be documented IMO.

PettitWesley avatar Aug 12 '22 19:08 PettitWesley

@leonardo-albertovich: Thanks for the tip. Some small parts of our cluster run in GCP, but the overwhelming majority is comprised of globally distributed bare-metal machines. To be sure I understand the option, does it simply mean that if a DNS query returns both a v4 and v6 address for a name, that fluent-bit will always chose to use the v4 address over the v6 address? If so, I'm not sure how that would help us.

In my previous post I said I would report back on what the ndt.dns.mode=TCP and ndots:2 configuration did for us. They seem to have helped. Since we implemented them we haven't seen any other failures that appear on the surface to be related to failing DNS queries or timeouts. It's not clear exactly how or if ndt.dns.mode=TCP is working for us. It seems to have increased the coredns_dns_request_duration_seconds metric by around 100%, but that is okay because it was already only a couple milliseconds. Changing to ndots:2 made a huge difference, at least as far as the overall number of DNS requests and their results. The request rate to CoreDNS dropped by around 75%, and NXDOMAIN response codes went to almost nothing.

I still suspect there is some sort of bug in fluent-bit that causes a deadlock or something similar after certain network timeouts or failures.

nkinkade avatar Aug 12 '22 19:08 nkinkade

Yes @nkinkade that setting would cause fluent-bit to prefer ipv4 records any time both ipv4 and ipv6 records are available. It doesn't mean it it will stop using ipv6 if that's the only record type available, it's just about ordering and it's meant to address a very specific issue in GCE and it's not useful outside of that environment.

leonardo-albertovich avatar Aug 12 '22 19:08 leonardo-albertovich

@PettitWesley I think there is no intention to hide configuration options, actually the binary helper list them here:

~/coding/fluent-bit/build (master) » bin/fluent-bit -o stackdriver -h                                   130 ↵ edsiper@monox-2
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

HELP
stackdriver output plugin

DESCRIPTION
Send events to Google Stackdriver Logging

OPTIONS
google_service_credentials     Set the path for the google service credentials file
                               > default: default, type: string

...redacted...

custom_k8s_regex               Set a custom kubernetes regex filter
                               > default: (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$, type: string

resource_labels                Set the resource labels
                               > default: default, type: multiple comma delimited strings


NETWORKING
net.dns.mode                   Select the primary DNS connection type (TCP or UDP)
                               > default: default, type: string

net.dns.resolver               Select the primary DNS resolver type (LEGACY or ASYNC)
                               > default: default, type: string

net.dns.prefer_ipv4            Prioritize IPv4 DNS results when trying to establish a
                               connection
                               > default: false, type: boolean

net.keepalive                  Enable or disable Keepalive support
                               > default: true, type: boolean

net.keepalive_idle_timeout     Set maximum time allowed for an idle Keepalive connection
                               > default: 30s, type: time

net.connect_timeout            Set maximum time allowed to establish a connection, this
                               time includes the TLS handshake
                               > default: 10s, type: time

net.connect_timeout_log_error  On connection timeout, specify if it should log an error.
                               When disabled, the timeout is logged as a debug message
                               > default: true, type: boolean

net.source_address             Specify network address to bind for data traffic
                               > default: default, type: string

net.keepalive_max_recycle      Set maximum number of times a keepalive connection can be
                               used before it is retired.
                               > default: 2000, type: integer

we will make sure to update the web docs with such same info, but again, there is no such "special hidden settings", just undocumented in web.. it will be fixed soon

edsiper avatar Aug 12 '22 21:08 edsiper

@edsiper Could you check the PR?

030 avatar Sep 12 '22 08:09 030

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Dec 12 '22 02:12 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Dec 18 '22 01:12 github-actions[bot]