fluent-bit
fluent-bit copied to clipboard
DNS resolution timeout/failure in 1.8.9
Bug Report
Describe the bug Hi, I am facing a DNS resolution timeout/failure using 1.8.9 with the forward module to a stackdriver.
To Reproduce
- upgrade from 1.8.3 to 1.8.9
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:37] [error] [input:emitter:fluent_log_emitted] error registering chunk with tag: st01.fluent
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
[2021/10/29 18:43:38] [ warn] [engine] chunk '1-1635532763.608049522.flb' cannot be retried: task_id=41, input=standard_log_emitted > output=stackdriver.1
[2021/10/29 18:43:41] [ warn] [input] emitter.8 paused (mem buf overlimit)
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:41] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
Your Environment
- Version used: 1.8.9
- Configuration: stackdriver output and tail input
- Environment name and version (e.g. Kubernetes? What version?): K8S 1.19
- Filters and plugins: stackdriver & tail
Additional context Some fluent-bit pods eventually output logs such as Resource temporarily unavailable and gave up:
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [input:emitter:standard_log_emitted] error registering chunk with tag: st01.kubernetes.order-queue-processor-st01
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1
[2021/10/29 18:43:58] [error] [src/flb_http_client.c:1172 errno=11] Resource temporarily unavailable
[2021/10/29 18:43:58] [ warn] [output:stackdriver:stackdriver.1] http_do=-1
Issue could be related to https://github.com/fluent/fluent-bit/issues/4050
Please also see: https://github.com/fluent/fluent-bit/issues/4257
I have a similar issue by using fluent/fluent-bit:1.8.9-debug
, fluent-bit can not resolve the headless service in AKS to forward the logs to flunetd statefulset. fluent/fluent-bit:1.8.4-debug
image does not give those errors, unfortunately, I have to downgrade till there will be a fix for this bug, tested nslookup in fluent-bit and still can not resolve, I tested nslookup with an ubuntu image and it works, is there any chance
fluentbit logs:
[2021/11/15 15:14:49] [ warn] [engine] failed to flush chunk '1-1636989288.537725817.flb', retry in 10 seconds: task_id=1, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:50] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:50] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:50] [ warn] [engine] failed to flush chunk '1-1636989289.606633533.flb', retry in 8 seconds: task_id=3, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.988513419.flb', retry in 8 seconds: task_id=8, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.511459650.flb', retry in 7 seconds: task_id=4, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.0] no upstream connections available
[2021/11/15 15:14:51] [ warn] [engine] failed to flush chunk '1-1636989290.704430196.flb', retry in 7 seconds: task_id=6, input=tail.0 > output=forward.0 (out_id=0)
[2021/11/15 15:14:51] [ warn] [net] getaddrinfo(host='fluentd-1.fluentd-headless', err=4): Domain name not found
[2021/11/15 15:14:51] [error] [output:forward:forward.1] no upstream connections available
I get the same error with the Elasticsearch output when configuring it with the Cloud_ID and Cloud_Auth config in both Minikube and AKS. I tried with multiple versions: 1.8.x series: I get the Domain Not Found error described above 1.7.9: I get an Unknown error.
The exact error (with v1.8.10) is:
[2021/11/22 20:03:17] [ warn] [net] getaddrinfo(host='***redacted***.azure.elastic-cloud.com:9243', err=4): Domain name not found
[2021/11/22 20:03:17] [debug] [upstream] connection #-1 failed to **redacted***.azure.elastic-cloud.com:9243:443
So I have two observations:
- Maybe Fluent Bit is doing a DNS lookup using the host AND the port? The correct lookup would use only the host, not the port concatenated to it.
- It tries to connect to the host using two ports: The correct one and 443. I tried to set the port manually in addition to the Cloud_ID setting, but I get the same result.
On 1.8.10 sporadically reproduced too [input:emitter:emitter_for_rewrite_tag.6] error registering chunk with tag:
I tried the helm chart and manual installation but I have the same problem. is there any solution?
- fluent-bit version 1.8.9
- running on AKS.
[2022/02/07 20:21:18] [ warn] [net] getaddrinfo(host='*****.westeurope.azure.elastic-cloud.com:9243', err=4): Domain name not found [2022/02/07 20:21:18] [ warn] [engine] failed to flush chunk '1-1644265237.698532458.flb', retry in 16 seconds: task_id=363, input=tail.0 > output=es.0 (out_id=0)
Hello,
I had the same issue. Is there some resolution available ? This problem will be fixed?
Regards,
Can you retest with the latest 1.8 (1.8.13 currently) or 1.9.0 release? There have been various fixes around DNS.
@patrick-stephens Issue also occurs in 1.9.3.
[2022/05/13 09:13:27] [ warn] [net] getaddrinfo(host='xyz.westeurope.azure.elastic-cloud.com:9243', err=4):
Domain name not found
@bensta I think you are right. When I issue a curl inside the fluent-debug container then a response is returned. If the issue would be related to the kube-dns or resolving then a resolving error should be returned by curl as well.
Related: https://stackoverflow.com/q/69405701/2777965
Our Cloud ID has been changed automatically recently from europe-west4.gcp.elastic-cloud.com$...
to europe-west1.gcp.cloud.es.io:443$...
(the decoded representation here) and all fluentbits stopped working (1.8.15) with
[ warn] [net] getaddrinfo(host='....europe-west4.gcp.elastic-cloud.com:443', err=4): Domain name not found
@030 This issue still could be seen with the latest version. Ping for some attention. Try to confirm if the PR mentioned above could fix this issue or not and when do you plan to release it?
Not sure how much helpful is this, but I've faced the same problem for elastic cloud.As a workaround, Instead of using Cloud-ID & Cloud-Auth, append the cloud config in regular ES format :
host domain-example.aws.elastic-cloud.com
port 443
HTTP_User elastic
HTTP_Passwd es-password
@DrewZhang13 Processed the PR feedback. We will have to wait until it is reviewed by one of the maintainers.
@leonardo-albertovich We are still seeing customers who report this issue with elasticsearch and say that neither net.dns.mode setting (TCP and UDP) helps.
Example config:
[OUTPUT]
Name es
Match kube.*
Host my-host
Port 9200
Logstash_Format true
Logstash_Prefix logstash
Logstash_DateFormat %Y.%m.%d
Time_Key @timestamp
tls Off
net.dns.mode UDP # TCP doesn't work either
Which fluent-bit version are they running?
@leonardo-albertovich 1.8.9, same as reported in this issue.
We are also seeing this same issue with fluent-bit v1.9.3. We are seeing repeated log messages like the following, and fluent-bit does not upload logs using the Stackdriver output plugin:
[2022/07/25 21:59:25] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=12): Timeout while contacting DNS servers
I have not yet tried ndt.dns.mode=TCP
, but am in the process of trying that. I will also note that inside of our fluent-bit containers the default resolver setting of ndots:5
is in place, and we are actively testing setting that to ndots:2
. With ndots:5
, we saw that fluent-bit was issuing 4 DNS queries for every external name resolution (e.g., "logging.googleapis.com"), which is not optimal at all. We have not yet got ndots:2
into our production platform, but I will report back here whether ndt.dns.mode=TCP
in conjunction with ndots:2
will help us with this issue.
Side note: our platform is very geographically dispersed (all over the globe). Anecdotally, we are seeing these DNS issues with fluent-bit instances running on nodes that are very far away geographically from the VMs where CoreDNS is running in our clusters. We have been hypothesizing that the repeated failing DNS queries caused by ndots:5
on top of UDP packets traversing large distances, many hops and many networks, might be at least partially to blame for the DNS issues we are seeing with fluent-bit. Again, I will report back here once I have more details about how these mitigations help us, or not.
@nkinkade I had to test stackdriver in GCE yesterday and in order to get it to work properly there you need to add this option dns.prefer_ipv4 on
, that makes fluent-bit prioritize ipv4 results when querying the nameserver which works around the underlying issue with gce not allowing ipv6 connections to that address from those networks (I don't remember the exact cause because it's been a long time).
@leonardo-albertovich How come all DNS settings are not documented? https://github.com/fluent/fluent-bit/blob/master/src/flb_upstream.c#L43
I think some of those settings catered some very specific corner cases and weren't meant for general usage.
@leonardo-albertovich I will bring this up next time we have some sort of community or other meeting, IMO we should not have special hidden settings that only some maintainers understand and know about. If a setting needs a warning attached to it or some caveats, sure, that makes sense, but anything that exists should be documented IMO.
@leonardo-albertovich: Thanks for the tip. Some small parts of our cluster run in GCP, but the overwhelming majority is comprised of globally distributed bare-metal machines. To be sure I understand the option, does it simply mean that if a DNS query returns both a v4 and v6 address for a name, that fluent-bit will always chose to use the v4 address over the v6 address? If so, I'm not sure how that would help us.
In my previous post I said I would report back on what the ndt.dns.mode=TCP
and ndots:2
configuration did for us. They seem to have helped. Since we implemented them we haven't seen any other failures that appear on the surface to be related to failing DNS queries or timeouts. It's not clear exactly how or if ndt.dns.mode=TCP
is working for us. It seems to have increased the coredns_dns_request_duration_seconds
metric by around 100%, but that is okay because it was already only a couple milliseconds. Changing to ndots:2
made a huge difference, at least as far as the overall number of DNS requests and their results. The request rate to CoreDNS dropped by around 75%, and NXDOMAIN response codes went to almost nothing.
I still suspect there is some sort of bug in fluent-bit that causes a deadlock or something similar after certain network timeouts or failures.
Yes @nkinkade that setting would cause fluent-bit to prefer ipv4 records any time both ipv4 and ipv6 records are available. It doesn't mean it it will stop using ipv6 if that's the only record type available, it's just about ordering and it's meant to address a very specific issue in GCE and it's not useful outside of that environment.
@PettitWesley I think there is no intention to hide configuration options, actually the binary helper list them here:
~/coding/fluent-bit/build (master) » bin/fluent-bit -o stackdriver -h 130 ↵ edsiper@monox-2
Fluent Bit v2.0.0
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
HELP
stackdriver output plugin
DESCRIPTION
Send events to Google Stackdriver Logging
OPTIONS
google_service_credentials Set the path for the google service credentials file
> default: default, type: string
...redacted...
custom_k8s_regex Set a custom kubernetes regex filter
> default: (?<pod_name>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$, type: string
resource_labels Set the resource labels
> default: default, type: multiple comma delimited strings
NETWORKING
net.dns.mode Select the primary DNS connection type (TCP or UDP)
> default: default, type: string
net.dns.resolver Select the primary DNS resolver type (LEGACY or ASYNC)
> default: default, type: string
net.dns.prefer_ipv4 Prioritize IPv4 DNS results when trying to establish a
connection
> default: false, type: boolean
net.keepalive Enable or disable Keepalive support
> default: true, type: boolean
net.keepalive_idle_timeout Set maximum time allowed for an idle Keepalive connection
> default: 30s, type: time
net.connect_timeout Set maximum time allowed to establish a connection, this
time includes the TLS handshake
> default: 10s, type: time
net.connect_timeout_log_error On connection timeout, specify if it should log an error.
When disabled, the timeout is logged as a debug message
> default: true, type: boolean
net.source_address Specify network address to bind for data traffic
> default: default, type: string
net.keepalive_max_recycle Set maximum number of times a keepalive connection can be
used before it is retired.
> default: 2000, type: integer
we will make sure to update the web docs with such same info, but again, there is no such "special hidden settings", just undocumented in web.. it will be fixed soon
@edsiper Could you check the PR?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.