beats
beats copied to clipboard
Add test for elasticsearch re-connection after network error & allow graceful shutdown
Proposed commit message
This commit reworks the eslegclient.Connection to accept a context in its Connect method, this allows the caller to cancel any in flight requests made by the connection by cancelling the context.
The libbeat outputs.Connectable interface (used by outputs.NetworkClient) had to be updated to accept the context, which required refactoring in most of the outputs to also accept a context on connect.
The worker from libbeat/publisher/pipeline/client_worker.go now uses a context for it's cancellation instead of a channel, this context is also used when creating a connection to Elasticsearch.
An integration test is added to ensure the ES output can always recover from network errors.
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- ~~[ ] I have made corresponding changes to the documentation~~
- ~~[ ] I have made corresponding change to the default configuration files~~
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] I have added an entry in
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.
Disruptive User Impact
It's a bug fix, there is no disruptive user impact
~~## Author's Checklist~~
How to test this PR locally
- Build Filebeat
- Get it sending data to ES
- Disconnect from the network, stop ES, do anything that will prevent Filebeat from reaching ES
- Wait for network error logs
- Re-start ES/reconnect to the network
- Filebeat should recover and start sending data again.
Related issues
- https://github.com/elastic/beats/issues/40705
~~## Use cases~~ ~~## Screenshots~~ ~~## Logs~~
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue
This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @belimawr? 🙏. For such, you'll need to label your PR with:
- The upcoming major version of the Elastic Stack
- The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)
To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-8./dis the label to automatically backport to the8./dbranch./dis the digit
backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
I want to make sure I follow the lifetime of the connection properly. The Connect and Publish calls come from here: https://github.com/elastic/beats/blob/3c03c74b847bf35926a8b29aee6be76144982afe/libbeat/publisher/pipeline/client_worker.go#L131-L158
The close comes from here: https://github.com/elastic/beats/blob/3c03c74b847bf35926a8b29aee6be76144982afe/libbeat/outputs/backoff.go#L60-L64
Can you make Connect accept a context.Context so that the context is actually tied to the lifetime of the connection the way it is supposed to be? Then the client worker run() function would be responsible for creating+cancelling the context. That you can't see the close actually happen in that loop since it is dependent on the client type implementation is also annoying but one thing at a time.
This would require touching the interface of every output but it looks like the correct place for the lifetime of the context to be managed.
It looks like we only have one other use of eslegclient for monitoring that is not a test. https://github.com/elastic/beats/blob/3c03c74b847bf35926a8b29aee6be76144982afe/libbeat/monitoring/report/elasticsearch/elasticsearch.go#L215-L218
Can you make
Connectaccept acontext.Contextso that the context is actually tied to the lifetime of the connection the way it is supposed to be? Then the client workerrun()function would be responsible for creating+cancelling the context. That you can't see the close actually happen in that loop since it is dependent on the client type implementation is also annoying but one thing at a time.This would require touching the interface of every output but it looks like the correct place for the lifetime of the context to be managed.
👍 , and in addition if this is done it would require the rest of outputs to honor that new context, too. IIRC each uses different cancellation mechanisms atm.
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)
Pinging @elastic/sec-windows-platform (Team:Security-Windows Platform)
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)
@nfritts / @bturquet (welcome back) could we please have someone to review here?
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue
I can only follow this pr to the point that all references of eslegclient are handled properly and checked that tests and build are successful. Approving this and will love to see the follow up for inflight requests cancelation (comment https://github.com/elastic/beats/pull/40794#discussion_r1761801658 )
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue
This pull request is now in conflicts. Could you fix it? 🙏 To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/
git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue
Do we have proof that this actually solves these two bugs?
- https://github.com/elastic/beats/issues/40518
- https://github.com/elastic/beats/issues/38666
I see the original test to prove we can recover from a network error, but not anything proving that the Beat shut down quickly if one of its outgoing network requests is long running.
Do we have proof that this actually solves these two bugs?
- Windows service for Beat does not stop when output is unreachable #40518
- Windows service for Beat does not stop gracefully #38666
I see the original test to prove we can recover from a network error, but not anything proving that the Beat shut down quickly if one of its outgoing network requests is long running.
I have not had success reproducing those bugs manually, I tried with an older version of Filebeat that should still have the issue, however I didn't go through the whole Windows service manager, I was trying to get the Filebeat to get stuck during its shutdown process.
I can try it once more to reproduce and craft a test to ensure the bugs are fixed.
@marc-gr could you please review this PR?
I've unlinked this PR from https://github.com/elastic/beats/issues/40928. While the refactoring here should fix the issues mentioned on https://github.com/elastic/beats/issues/40928, we still need tests to ensure it. This PR is already rather large, so let's merge it as it is and work on writing the tests for https://github.com/elastic/beats/issues/40928 on a separated PR.
I also removed the backport from 8.15 and 8.16. This PR is rather large and the FF has passed, if needed we can backport it to a 8.16 patch release after more extensive testing.
This does fix https://github.com/elastic/beats/issues/40928 as it was written, but it doesn't fix all of the related issues.
The scope of https://github.com/elastic/beats/issues/40928 is only the regression test for the 8.15.1 bug where we would not reconnect after an error. You do have a test for that in this PR, so you can close it.
The bugs we can’t close are https://github.com/elastic/beats/issues/40518 and https://github.com/elastic/beats/issues/38666 which have no test proving they were fixed. I think the cause of these is that there was no context propagated from the SIGINT handler to the ES output to allow interrupting a long running connection. I think you wired this up, but we haven’t actually confirmed this fixed the problem so can’t close them.