network: Only the first nameserver in resolv.conf is ever used
Bug Report
Describe the bug
Given two nameserver records in /etc/resolv.conf, fluent-bit doesn't appear to ever use the second record. In particular, when the first record is unavailable (connection refused), fluent-bit simply gives up and errors.
To Reproduce
- Configure two DNS servers in resolv.conf (say 127.0.0.1 and a real value)
- Shut down the first one
[2022/04/11 19:00:36] [ warn] [net] getaddrinfo(host='example.com', err=12): Timeout while contacting DNS servers
Expected behavior I would expect the application to fail-over and attempt resolution against the second nameserver entry.
Screenshots n/a
Your Environment
- Version used: 1.8.15 and 1.9.0
- Configuration:
[OUTPUT]
Name es
Match journal.*
Host elk.example.com
Port 443
Index logs-journal
Aws_Auth On
Aws_Region us-east-1
Tls On
- Environment name and version (e.g. Kubernetes? What version?): EC2 and ECS Fargate
- Server type and version: n/a
- Operating System and version: CentOS 7
- Filters and plugins: n/a
This has been observed in both the td-agent-bit packages (1.9.0) and the aws/aws-for-fluent-bit images (1.8.15).
Additional context
We set 127.0.0.1 as our instances have local caching daemons running (dnsmasq). Fluent-bit does not appear to gracefully failover the DNS if the primary resolver is offline or net yet started.
We've observed v1.8.1 does not exhibit this behavior. I'm guessing this is the result of changes in v1.8.5, but I have not bisected the releases to verify only skimmed the release notes.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This is still an active issue.
I'll take a look, thanks for letting us know it's still a problem.
@leonardo-albertovich Now I feel bad! :/ lol.
When I submitted the report, I tested with the latest fluent-bit I could get my hands on (1.9.0 for my EC2 instances, 1.18.15 for my ECS Fargate containers) and I would observe the issue. I haven't seen anything since then that suggests it's been addressed which is what I should have replied to the bot with.
If you want me to vet, I can try a test tomorrow with 1.9.5 and see if I can trigger the behavior.
Definitely, it'd be great if you can double check it since that'd save me a bit of time trying to reproduce it in case it's fixed.
Thanks for staying on top of it regardless!
Okay, using the latest 1.9.6 with the following command I still see the issue.
fluent-bit -i cpu -o http -p host=www.example.com -v
With a valid entry in the first position of my resolv.conf I get:
...
[2022/07/13 13:39:44] [ info] [output:http:http.0] www.example.com:80, HTTP status=200
When I place an invalid entry (or stop my local dnsmasq) in the first position:
...
[2022/07/13 13:39:24] [ warn] [net] getaddrinfo(host='www.example.com', err=12): Timeout while contacting DNS servers
Awesome, I'll take a look at it since I wrote that code, I think I have some ideas as to what could be the issue but I need to validate them and try to come up with a workaround.
Thanks for your help, please ping me back in a week if I don't answer since I have a few things on my plate right now and it could slip through the cracks.
Hi @leonardo-albertovich, just a gentle ping on this issue as requested.
Thank you!
Thanks for staying on top of it @dekimsey, I still haven't been able to take a look at it, I know what the issue in the mechanism is and have a few ideas to make it better but no time to get to it yet. If you are interested in working on it feel free to message me on slack and I can get you up to speed on it. Otherwise, I'll take a look at it as soon as possible.
Hi @leonardo-albertovich thank you for the offer but C is way outside my area of familiarity. I don't think I'd be effective. I'll wait patiently :)
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This is still an on-going issue
You are right @dekimsey, that is a work in progress but sadly we weren't able to include it in 2.0. I've added the exempt-stale label to this issue so it doesn't go away until we release that improvement.
This issue can be mitigated by setting: https://docs.fluentbit.io/manual/administration/networking
net.dns.resolver LEGACY