beats
beats copied to clipboard
beats locker should be pid aware
Currently it is possible for a beat to incorrectly shut down and not remove the instance locker for the data path: https://github.com/elastic/beats/blob/1d78d15fb59bb17b8ad59783192904ebc407a2c5/libbeat/cmd/instance/locker.go#L50-L61 This can occur (irregularly) when a beat is running under the elastic-agent on a machine that goes to sleep/hibernation.
The data-path lock should be aware of the pid of the process that locks the directory. If the lock(or pid file) exists but the process identified by the pid does not exist, the lock should be considered invalid and removed/replaced
- [x] Relates https://github.com/elastic/elastic-agent/issues/914
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
Alright, trying to spec out a few different ways to do this. This is a tad complex, as different operating systems have different file-locking implementations. The best solution here is probably to have a separate pidfile that we can check during the locking/unlocking logic.
It seems like the agent V2 branch is affected by this: https://github.com/elastic/elastic-agent/issues/997
At least the V2 branch can create a situation that triggers this problem, likely there is another bug with the V2 agent creating and killing the monitoring Beats rapidly.
We have several reports that this problem is more common when the system running the agent goes to sleep or hibernates. We should explicitly try to test this situation with the fix to ensure it behaves reasonable.
Hi @joshdover @jlind23 @cmacknz We have revalidated this PR and https://github.com/elastic/elastic-agent/issues/914 on 8.6 SNAPSHOT.
Steps followed:
- Installed MAC and Windows fleet based agents with System integration.
- Reassigned these agents to new Policy.[Multiple times]
- Restarted VMs.
- Shutdown VMs for approximately 30 mins and restarted again.
Observations:
- We didn't observe any Metricbeat failure errors.
- Filebeat and Metricbeat gets back to RUNNING state.
- No below error line:
metricbeat stderr: Exiting: data path already locked by another beat. Please make sure that multiple beats are not sharing the same data path (path.data).
Please let us know if we are missing any scenario here. Thank You!
Hi, unfortunately, after updating to 8.5.1
we are still observing the same error but now with a different message.
{"log.level":"error","@timestamp":"2022-11-18T11:36:01.500Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"filebeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 66\\n\"","agent.console.name":"filebeat","agent.console.type":"stderr","ecs.version":"1.6.0"}
While running 8.5.0
we observed the same error mentioned above and in https://github.com/elastic/elastic-agent/issues/997.
We have a DeamonSet of the agents running on multiple clusters. Here are the logs from one of them
Edit: misread the comment, this is a report against 8.5.1 which should have fixed this.
@fearful-symmetry any ideas on what is going on here?
Sorry about the delay, looks like I forgot to check my github notifications on Friday.
@HaniAlshikh so, if we get the data directory belongs to process with ...
error, that means that the beat sees a running process that owns the existing [BEATNAME].pid
file.
A few things:
- How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS?
- What process on the system matches that PID, and what state is that process in? For example, check with
ps aux | grep 66
on linux.
I was able to also reproduce this running locally (MacOS) with a minikube setup. I deployed one of our ECK recipes with Fleet/Elastic Agent, noticed some missing 'ClusterRole's, made the edit and reapplied the config. The new Elastic Agent pods on 2 hosts then started logging errors, both for Metricbeat and Filebeat running in the pod, ex:
{"log.level":"error","@timestamp":"2022-11-21T20:38:52.073Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"metricbeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 951\\n\"","agent.console.name":"metricbeat","agent.console.type":"stderr","ecs.version":"1.6.0"}
root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}
metricbeat--8.5.1/metricbeat.lock
{"pid":951,"write_time":"2022-11-21T14:38:10.679257079Z"}
root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# rm metricbeat--8.5.1/metricbeat.lock
root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}
root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# uname -a
Linux minikube-m03 5.15.49-linuxkit #1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
I could not find any process running with the noted PIDs so it appears it is not following through and removing that 'old' lock file (?). I deleted both manually and after a couple of minutes, Elastic Agents began to report healthy again.
Also - in case not already noted - spelling of 'connot' (cannot)?
@ajoliveira can you share the recipe and steps you took? Trying to reproduce this, but I don't have much minikube experience.
This isn't a clean up issue, the issue is the beat detecting that an existing process with that PID, so it assumes another beat is already running and shuts down. I wonder if this is an issue involving k8s process namespacing somwhow, or perhaps how k8s in particular is restarting the process.
@ajoliveira 's issue is on darwin, so I'm curious what @HaniAlshikh is running on, since it could be specific to the process API used by beats.
If this turns out to be an issue with how elastic-agent or k8s is handling restarts, I wonder if we should just disable lockfiles for beats under elastic-agent, and make sure the agent is sufficiently smart when detecting when it's child processes are or aren't running.
Also @ajoliveira can I see the environment variables that agent is running with as well?
@fearful-symmetry Recipe I was using from elastic/cloud-on-k8s, specifically the 'fleet-kubernetes-integration' example.
In trying to reproduce a separate issue, I wanted to add POD_NAME
as a variable to make it easier to identify Elastic Agent holding leader lease lock vs. unique ID that does not correlate to specific pod ... as well as adding more API resources to the ClusterRoles to allow collection of data for PVs, PVCs, etc. I'm assuming any other changes could do same if pods are replaced (?)
Not sure if you mean ENV vars that I'm setting, only one I added was for POD_NAME
:
containers:
- name: agent
securityContext:
runAsUser: 0
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
But also including output of env | sort
from one of the pods - elasti-agent-pod-env.txt
Additional data points just in case: minikube v1.27.0 on Darwin 12.6.1 w/ Kubernetes 1.25.0
Gonna try to reproduce this a bit more. It seems like there's two possibilities:
- When a beat restarts, the old beat is still running and hasn't fully shut down
- The process has shut down, but the OS hasn't fully cleaned up the resources, so it's reported as still running.
- In the time between restarts, another process has started and taken the PID the beat previously had. On docker, PIDs are usually incremented as they're started, so I'm not sure how likely this is.
@fearful-symmetry regarding your questions:
- How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS?
Yes, 100% of the time but not for all agents at once from ~120 agents we had ~20-30 agents affected but the agents seemed to alternate between having the problem and not having it (so basically all agents but not at the same time)
we are running the official image so: (
docker.elastic.co/beats/elastic-agent-complete
)
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
- What process on the system matches that PID, and what state is that process in? For example, check with ps aux | grep 66 on linux.
this is old data as we had to downgrade to fix the problem but at the time we noticed, that once we got the log message then it is too late to investigate the process. We didn't really invest much time here but this how
ps aux
looked like
root@es-agent-elastic-agent-64fqx:/usr/share/elastic-agent# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2500 528 ? Ss Nov17 0:00 /usr/bin/tini -- /usr/local/bin/docker-entrypoint
root 7 101 1.3 2214468 432280 ? Sl Nov17 1343:10 elastic-agent container
root 14953 0.0 0.0 4248 3424 pts/0 Ss 09:18 0:00 bash
root 14961 0.0 0.0 5900 2752 pts/0 R+ 09:18 0:00 ps -aux
Reopening while we evaluate the new reports
Looks like we are still seeing this in some of our own scale tests on VMs: https://github.com/elastic/elastic-agent/issues/1790#issuecomment-1329177230
a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel...
we've seen, but so far haven't caught any impact of it?
a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel... we've seen, but so far haven't caught any impact of it?
I think this is definitely possible. Talking with @fearful-symmetry we should actually investigate if we even need the Beats data path locks under agent, since they cannot be installed with overlapping data paths by agent. We can hopefully just bypass the file locking when a Beat detects it is run with the management.enabled flag set.
It would be much simpler to just delete the possibility of any future file lock problems.
Closing, we have removed use of these lock files when Beats are run under agent in 8.7.
Closing, we have removed use of these lock files when Beats are run under agent in 8.7.
@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?
Closing, we have removed use of these lock files when Beats are run under agent in 8.7.
@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?
same question @cmacknz, its affecting about 400 of our devices on 8.5
Manually deleting the lock files before Filebeat starts should be the work around here. Wrapping agent or filebeat in a script that unconditionally deletes them is one way, they aren't actually necessary.
@cmacknz is there a PowerShell command you could provide to do this workaround?