beats [Heartbeat] Retrying ES connection endlessly

Heartbeat when run in run_once: true mode, tries to establish the connection to ES multiple times and keeps retrying endlessly and not responding to SIGTERM. The only way to kill the instance is to issue SIGKILL command.

{"log.level":"info","@timestamp":"2022-06-24T23:18:41.806Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/client_worker.go","file.line":141},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200) with 31 reconnect attempt(s)","service.name":"heartbeat","ecs.version":"1.6.0"}

This is a P1 bug as the resource is consumed endlessly unless the underlying machine using the Heartbeat is killed. I had to manually kill the process to release its resources.

Jun 24 '22 23:06 vigneshshanmugam

Pinging @elastic/uptime (Team:Uptime)

Jun 24 '22 23:06 elasticmachine

This is going to be tricky to accomplish in code, but not impossible. The ES output assumes you want to retry a bunch, and doesn't have the ability to shutdown the process (or communicate that back up the process hierarchy to the beat entrypoint).

The easiest way to handle this would be to let the ES output just call os.Exit(1) itself. It's kind of ugly from a flow control standpoint though. I'd first see if we could signal a blocked output somewhere up the chain to heartbeat.go, but if that's too hard we may have to resort to the first option.

Jun 30 '22 02:06 andrewvc

I had a look at the code for the ES Output Publisher, It seems to do backoff by default. Also checked if any of these settings would work like setting backoff.max or max.retries that we can use for ES output but AFAIC, there is no way to close the connection through the client itself.

Do you think we can rewrite the Sync client and kind of do this

Listen for the onACK messages in the client.
If we don't receive them for a specified timeout (network error in our case), Kill the Hb process assuming the run_once did not work?

I dont know if its any better than passing a timeout context, WDYT?

Aug 09 '22 20:08 vigneshshanmugam

We are seeing this happening more often in the Service, some of the jobs are getting killed after the timeout of 15 minutes bcoz of endless connection retries.

Aug 30 '22 21:08 vigneshshanmugam

Any idea why they can't connect? Bad ES instances? Decommissioned stacks?

This is kinda part of the design, but I'm glad to set a max retries option for run once mode. That said, the ES output code is incredibly complex, so it may be a time consuming adventure.

On Tue, Aug 30, 2022 at 5:00 PM Vignesh Shanmugam @.***> wrote:

We are seeing this happening more often in the Service, some of the jobs are getting killed after the timeout of 15 minutes bcoz of endless connection retries.

— Reply to this email directly, view it on GitHub https://github.com/elastic/beats/issues/32100#issuecomment-1232212520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABACY74OT4KQAWGKIPBQRLV3Z76LANCNFSM5ZZIHVEA . You are receiving this because you are on a team that was mentioned.Message ID: @.***>

Aug 30 '22 23:08 andrewvc

There was no debug logs to indicate the source of the issue. My initial thought was around the API key permission issue, but cant say for sure.

Yeah for run-once mode, we need to set a max retries of 3 (ideally less than 5) to keep it intact and not take up the resource for 15 minute incase of a bad user issue.

Aug 31 '22 04:08 vigneshshanmugam

Moving to TODO as there was no PR in progress.

Sep 28 '22 21:09 vigneshshanmugam

beats beats copied to clipboard

[Heartbeat] Retrying ES connection endlessly

beats
beats copied to clipboard