beats beats locker should be pid aware

Currently it is possible for a beat to incorrectly shut down and not remove the instance locker for the data path: https://github.com/elastic/beats/blob/1d78d15fb59bb17b8ad59783192904ebc407a2c5/libbeat/cmd/instance/locker.go#L50-L61 This can occur (irregularly) when a beat is running under the elastic-agent on a machine that goes to sleep/hibernation.

The data-path lock should be aware of the pid of the process that locks the directory. If the lock(or pid file) exists but the process identified by the pid does not exist, the lock should be considered invalid and removed/replaced

[x] Relates https://github.com/elastic/elastic-agent/issues/914

May 18 '22 17:05 michel-laterman

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

May 18 '22 17:05 elasticmachine

Alright, trying to spec out a few different ways to do this. This is a tad complex, as different operating systems have different file-locking implementations. The best solution here is probably to have a separate pidfile that we can check during the locking/unlocking logic.

Sep 21 '22 17:09 fearful-symmetry

It seems like the agent V2 branch is affected by this: https://github.com/elastic/elastic-agent/issues/997

At least the V2 branch can create a situation that triggers this problem, likely there is another bug with the V2 agent creating and killing the monitoring Beats rapidly.

Sep 29 '22 13:09 cmacknz

We have several reports that this problem is more common when the system running the agent goes to sleep or hibernates. We should explicitly try to test this situation with the fix to ensure it behaves reasonable.

Sep 29 '22 13:09 cmacknz

Hi @joshdover @jlind23 @cmacknz We have revalidated this PR and https://github.com/elastic/elastic-agent/issues/914 on 8.6 SNAPSHOT.

Steps followed:

Installed MAC and Windows fleet based agents with System integration.
Reassigned these agents to new Policy.[Multiple times]
Restarted VMs.
Shutdown VMs for approximately 30 mins and restarted again.

Observations:

We didn't observe any Metricbeat failure errors.
Filebeat and Metricbeat gets back to RUNNING state.
No below error line: metricbeat stderr: Exiting: data path already locked by another beat. Please make sure that multiple beats are not sharing the same data path (path.data).

Please let us know if we are missing any scenario here. Thank You!

Oct 21 '22 07:10 amolnater-qasource

Hi, unfortunately, after updating to 8.5.1 we are still observing the same error but now with a different message.

{"log.level":"error","@timestamp":"2022-11-18T11:36:01.500Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"filebeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 66\\n\"","agent.console.name":"filebeat","agent.console.type":"stderr","ecs.version":"1.6.0"}

While running 8.5.0 we observed the same error mentioned above and in https://github.com/elastic/elastic-agent/issues/997.

We have a DeamonSet of the agents running on multiple clusters. Here are the logs from one of them

elastic.log

Nov 18 '22 13:11 HaniAlshikh

Edit: misread the comment, this is a report against 8.5.1 which should have fixed this.

@fearful-symmetry any ideas on what is going on here?

Nov 18 '22 16:11 cmacknz

Sorry about the delay, looks like I forgot to check my github notifications on Friday. @HaniAlshikh so, if we get the data directory belongs to process with ... error, that means that the beat sees a running process that owns the existing [BEATNAME].pid file.

A few things:

How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS?
What process on the system matches that PID, and what state is that process in? For example, check with ps aux | grep 66 on linux.

Nov 21 '22 16:11 fearful-symmetry

I was able to also reproduce this running locally (MacOS) with a minikube setup. I deployed one of our ECK recipes with Fleet/Elastic Agent, noticed some missing 'ClusterRole's, made the edit and reapplied the config. The new Elastic Agent pods on 2 hosts then started logging errors, both for Metricbeat and Filebeat running in the pod, ex:

{"log.level":"error","@timestamp":"2022-11-21T20:38:52.073Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"metricbeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 951\\n\"","agent.console.name":"metricbeat","agent.console.type":"stderr","ecs.version":"1.6.0"}


root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}
metricbeat--8.5.1/metricbeat.lock
{"pid":951,"write_time":"2022-11-21T14:38:10.679257079Z"}

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# rm metricbeat--8.5.1/metricbeat.lock

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# uname -a
Linux minikube-m03 5.15.49-linuxkit #1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I could not find any process running with the noted PIDs so it appears it is not following through and removing that 'old' lock file (?). I deleted both manually and after a couple of minutes, Elastic Agents began to report healthy again.

Also - in case not already noted - spelling of 'connot' (cannot)?

Nov 21 '22 22:11 ajoliveira

@ajoliveira can you share the recipe and steps you took? Trying to reproduce this, but I don't have much minikube experience.

This isn't a clean up issue, the issue is the beat detecting that an existing process with that PID, so it assumes another beat is already running and shuts down. I wonder if this is an issue involving k8s process namespacing somwhow, or perhaps how k8s in particular is restarting the process.

@ajoliveira 's issue is on darwin, so I'm curious what @HaniAlshikh is running on, since it could be specific to the process API used by beats.

If this turns out to be an issue with how elastic-agent or k8s is handling restarts, I wonder if we should just disable lockfiles for beats under elastic-agent, and make sure the agent is sufficiently smart when detecting when it's child processes are or aren't running.

Nov 21 '22 23:11 fearful-symmetry

Also @ajoliveira can I see the environment variables that agent is running with as well?

Nov 21 '22 23:11 fearful-symmetry

@fearful-symmetry Recipe I was using from elastic/cloud-on-k8s, specifically the 'fleet-kubernetes-integration' example.

In trying to reproduce a separate issue, I wanted to add POD_NAME as a variable to make it easier to identify Elastic Agent holding leader lease lock vs. unique ID that does not correlate to specific pod ... as well as adding more API resources to the ClusterRoles to allow collection of data for PVs, PVCs, etc. I'm assuming any other changes could do same if pods are replaced (?)

Not sure if you mean ENV vars that I'm setting, only one I added was for POD_NAME:

        containers:
        - name: agent
          securityContext:
            runAsUser: 0
          env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

But also including output of env | sort from one of the pods - elasti-agent-pod-env.txt

Additional data points just in case: minikube v1.27.0 on Darwin 12.6.1 w/ Kubernetes 1.25.0

Nov 21 '22 23:11 ajoliveira

Gonna try to reproduce this a bit more. It seems like there's two possibilities:

When a beat restarts, the old beat is still running and hasn't fully shut down
The process has shut down, but the OS hasn't fully cleaned up the resources, so it's reported as still running.
In the time between restarts, another process has started and taken the PID the beat previously had. On docker, PIDs are usually incremented as they're started, so I'm not sure how likely this is.

Nov 21 '22 23:11 fearful-symmetry

@fearful-symmetry regarding your questions:

How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS? Yes, 100% of the time but not for all agents at once from ~120 agents we had ~20-30 agents affected but the agents seemed to alternate between having the problem and not having it (so basically all agents but not at the same time) we are running the official image so: (docker.elastic.co/beats/elastic-agent-complete)

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"

What process on the system matches that PID, and what state is that process in? For example, check with ps aux | grep 66 on linux. this is old data as we had to downgrade to fix the problem but at the time we noticed, that once we got the log message then it is too late to investigate the process. We didn't really invest much time here but this how ps aux looked like

root@es-agent-elastic-agent-64fqx:/usr/share/elastic-agent# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2500   528 ?        Ss   Nov17   0:00 /usr/bin/tini -- /usr/local/bin/docker-entrypoint
root           7  101  1.3 2214468 432280 ?      Sl   Nov17 1343:10 elastic-agent container
root       14953  0.0  0.0   4248  3424 pts/0    Ss   09:18   0:00 bash
root       14961  0.0  0.0   5900  2752 pts/0    R+   09:18   0:00 ps -aux

Nov 22 '22 09:11 HaniAlshikh

Reopening while we evaluate the new reports

Nov 22 '22 10:11 joshdover

Looks like we are still seeing this in some of our own scale tests on VMs: https://github.com/elastic/elastic-agent/issues/1790#issuecomment-1329177230

Nov 28 '22 15:11 cmacknz

a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel... we've seen, but so far haven't caught any impact of it?

Nov 28 '22 15:11 AndersonQ

a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel... we've seen, but so far haven't caught any impact of it?

I think this is definitely possible. Talking with @fearful-symmetry we should actually investigate if we even need the Beats data path locks under agent, since they cannot be installed with overlapping data paths by agent. We can hopefully just bypass the file locking when a Beat detects it is run with the management.enabled flag set.

It would be much simpler to just delete the possibility of any future file lock problems.

Dec 06 '22 19:12 cmacknz

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

Dec 13 '22 19:12 cmacknz

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?

May 15 '23 12:05 threatangler-jp

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?

same question @cmacknz, its affecting about 400 of our devices on 8.5

May 16 '23 15:05 bm11100

Manually deleting the lock files before Filebeat starts should be the work around here. Wrapping agent or filebeat in a script that unconditionally deletes them is one way, they aren't actually necessary.

May 16 '23 18:05 cmacknz

@cmacknz is there a PowerShell command you could provide to do this workaround?

May 16 '23 18:05 threatangler-jp

beats beats copied to clipboard

beats locker should be pid aware

beats
beats copied to clipboard