fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

network: Only the first nameserver in resolv.conf is ever used

Open dekimsey opened this issue 4 years ago • 9 comments

Bug Report

Describe the bug Given two nameserver records in /etc/resolv.conf, fluent-bit doesn't appear to ever use the second record. In particular, when the first record is unavailable (connection refused), fluent-bit simply gives up and errors.

To Reproduce

  • Configure two DNS servers in resolv.conf (say 127.0.0.1 and a real value)
  • Shut down the first one
[2022/04/11 19:00:36] [ warn] [net] getaddrinfo(host='example.com', err=12): Timeout while contacting DNS servers

Expected behavior I would expect the application to fail-over and attempt resolution against the second nameserver entry.

Screenshots n/a

Your Environment

  • Version used: 1.8.15 and 1.9.0
  • Configuration:
[OUTPUT]
  Name es
  Match journal.*
  Host elk.example.com
  Port 443
  Index logs-journal
  Aws_Auth On
  Aws_Region us-east-1
  Tls  On
  • Environment name and version (e.g. Kubernetes? What version?): EC2 and ECS Fargate
  • Server type and version: n/a
  • Operating System and version: CentOS 7
  • Filters and plugins: n/a

This has been observed in both the td-agent-bit packages (1.9.0) and the aws/aws-for-fluent-bit images (1.8.15).

Additional context We set 127.0.0.1 as our instances have local caching daemons running (dnsmasq). Fluent-bit does not appear to gracefully failover the DNS if the primary resolver is offline or net yet started.

We've observed v1.8.1 does not exhibit this behavior. I'm guessing this is the result of changes in v1.8.5, but I have not bisected the releases to verify only skimmed the release notes.

dekimsey avatar Apr 12 '22 16:04 dekimsey

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Jul 12 '22 02:07 github-actions[bot]

This is still an active issue.

dekimsey avatar Jul 12 '22 20:07 dekimsey

I'll take a look, thanks for letting us know it's still a problem.

leonardo-albertovich avatar Jul 12 '22 20:07 leonardo-albertovich

@leonardo-albertovich Now I feel bad! :/ lol.

When I submitted the report, I tested with the latest fluent-bit I could get my hands on (1.9.0 for my EC2 instances, 1.18.15 for my ECS Fargate containers) and I would observe the issue. I haven't seen anything since then that suggests it's been addressed which is what I should have replied to the bot with.

If you want me to vet, I can try a test tomorrow with 1.9.5 and see if I can trigger the behavior.

dekimsey avatar Jul 12 '22 20:07 dekimsey

Definitely, it'd be great if you can double check it since that'd save me a bit of time trying to reproduce it in case it's fixed.

Thanks for staying on top of it regardless!

leonardo-albertovich avatar Jul 12 '22 20:07 leonardo-albertovich

Okay, using the latest 1.9.6 with the following command I still see the issue.

fluent-bit -i cpu -o http -p host=www.example.com -v

With a valid entry in the first position of my resolv.conf I get:

...
[2022/07/13 13:39:44] [ info] [output:http:http.0] www.example.com:80, HTTP status=200

When I place an invalid entry (or stop my local dnsmasq) in the first position:

...
[2022/07/13 13:39:24] [ warn] [net] getaddrinfo(host='www.example.com', err=12): Timeout while contacting DNS servers

dekimsey avatar Jul 13 '22 13:07 dekimsey

Awesome, I'll take a look at it since I wrote that code, I think I have some ideas as to what could be the issue but I need to validate them and try to come up with a workaround.

Thanks for your help, please ping me back in a week if I don't answer since I have a few things on my plate right now and it could slip through the cracks.

leonardo-albertovich avatar Jul 13 '22 14:07 leonardo-albertovich

Hi @leonardo-albertovich, just a gentle ping on this issue as requested.

Thank you!

dekimsey avatar Aug 01 '22 15:08 dekimsey

Thanks for staying on top of it @dekimsey, I still haven't been able to take a look at it, I know what the issue in the mechanism is and have a few ideas to make it better but no time to get to it yet. If you are interested in working on it feel free to message me on slack and I can get you up to speed on it. Otherwise, I'll take a look at it as soon as possible.

leonardo-albertovich avatar Aug 02 '22 13:08 leonardo-albertovich

Hi @leonardo-albertovich thank you for the offer but C is way outside my area of familiarity. I don't think I'd be effective. I'll wait patiently :)

dekimsey avatar Aug 11 '22 17:08 dekimsey

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Nov 10 '22 02:11 github-actions[bot]

This is still an on-going issue

dekimsey avatar Nov 10 '22 02:11 dekimsey

You are right @dekimsey, that is a work in progress but sadly we weren't able to include it in 2.0. I've added the exempt-stale label to this issue so it doesn't go away until we release that improvement.

leonardo-albertovich avatar Nov 10 '22 09:11 leonardo-albertovich

This issue can be mitigated by setting: https://docs.fluentbit.io/manual/administration/networking

net.dns.resolver LEGACY

PettitWesley avatar Apr 17 '23 17:04 PettitWesley