beats
beats copied to clipboard
[Heartbeat] Retrying ES connection endlessly
- Heartbeat when run in
run_once: truemode, tries to establish the connection to ES multiple times and keeps retrying endlessly and not responding to SIGTERM. The only way to kill the instance is to issueSIGKILLcommand.
{"log.level":"info","@timestamp":"2022-06-24T23:18:41.806Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/client_worker.go","file.line":141},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200) with 31 reconnect attempt(s)","service.name":"heartbeat","ecs.version":"1.6.0"}
- This is a P1 bug as the resource is consumed endlessly unless the underlying machine using the Heartbeat is killed. I had to manually kill the process to release its resources.
Pinging @elastic/uptime (Team:Uptime)
This is going to be tricky to accomplish in code, but not impossible. The ES output assumes you want to retry a bunch, and doesn't have the ability to shutdown the process (or communicate that back up the process hierarchy to the beat entrypoint).
The easiest way to handle this would be to let the ES output just call os.Exit(1) itself. It's kind of ugly from a flow control standpoint though. I'd first see if we could signal a blocked output somewhere up the chain to heartbeat.go, but if that's too hard we may have to resort to the first option.
I had a look at the code for the ES Output Publisher, It seems to do backoff by default. Also checked if any of these settings would work like setting backoff.max or max.retries that we can use for ES output but AFAIC, there is no way to close the connection through the client itself.
Do you think we can rewrite the Sync client and kind of do this
- Listen for the
onACKmessages in the client. - If we don't receive them for a specified timeout (network error in our case), Kill the Hb process assuming the run_once did not work?
I dont know if its any better than passing a timeout context, WDYT?
We are seeing this happening more often in the Service, some of the jobs are getting killed after the timeout of 15 minutes bcoz of endless connection retries.
Any idea why they can't connect? Bad ES instances? Decommissioned stacks?
This is kinda part of the design, but I'm glad to set a max retries option for run once mode. That said, the ES output code is incredibly complex, so it may be a time consuming adventure.
On Tue, Aug 30, 2022 at 5:00 PM Vignesh Shanmugam @.***> wrote:
We are seeing this happening more often in the Service, some of the jobs are getting killed after the timeout of 15 minutes bcoz of endless connection retries.
— Reply to this email directly, view it on GitHub https://github.com/elastic/beats/issues/32100#issuecomment-1232212520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABACY74OT4KQAWGKIPBQRLV3Z76LANCNFSM5ZZIHVEA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>
There was no debug logs to indicate the source of the issue. My initial thought was around the API key permission issue, but cant say for sure.
Yeah for run-once mode, we need to set a max retries of 3 (ideally less than 5) to keep it intact and not take up the resource for 15 minute incase of a bad user issue.
Moving to TODO as there was no PR in progress.