beats
beats copied to clipboard
add pid awareness to file locking
What does this PR do?
This PR seeks to add pid-awareness to the lockfile system that's initialized when a beat starts up. When beats attempts to create the lockfile, it checks for a pre-existing file, and if it exists, it checks to see if the process listed in the file exists or not. If the process does not exist, (in the case of a bad shutdown where the lockfile never got removed) it will attempt to recover, and create a new lockfile.
Why is it important?
In automated agent systems, the beat can sometimes take to long to shutdown, causing agent to hard-kill the process. In this instance, the beat won't restart, as the lockfile is still hanging around. This fixes that.
- Closes #31670
Checklist
- [ ] Test while running under agent
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] I have added tests that prove my fix is effective or that my feature works
- [ ] I have added an entry in
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.
:green_heart: Build Succeeded
the below badges are clickable and redirect to their specific view in the CI or DOCS
![]()
![]()
![]()
![]()
![]()
Expand to view the summary
Build stats
-
Start Time: 2022-10-11T23:10:15.842+0000
-
Duration: 132 min 34 sec
Test stats :test_tube:
Test | Results |
---|---|
Failed | 0 |
Passed | 23656 |
Skipped | 1950 |
Total | 25606 |
:green_heart: Flaky test report
Tests succeeded.
:robot: GitHub comments
Expand to view the GitHub comments
To re-run your PR in the CI, just comment with:
-
/test
: Re-trigger the build. -
/package
: Generate the packages and run the E2E tests. -
/beats-tester
: Run the installation tests with beats-tester. -
run
elasticsearch-ci/docs
: Re-trigger the docs validation. (use unformatted text in the comment!)
@cmacknz
Were you able to reproduce the problem? Did we test this manually using the entire beat process in addition to the unit tests?
Yup, tested with metricbeat under the usual cases (remaining lockfile for dead process, two beats trying to start at once)
Okay, so, only issue that isn't dealt with is how to deal with zombie beats processes. It's definitely a problem, and I'm currently trying to find the most elegant way to handle this cross-platform.
This PR now depends on: https://github.com/elastic/elastic-agent-system-metrics/pull/53
Alright. This now can handle zombie states for processes. Also managed to test most of the major use cases manually (sudden stop & start without removing the lockfile, starting two beats at once, and re-starting while a zombie process remains), which was a bit of a pain. We should be good for the zombie issue now, at least.
Alright, so the typecheck errors in CI are not related, but kinda baffling. I assume the issue (based on what I've hit before) is the fact that the github CI thing has cgo disabled, and the build constraints in elastic-agent-system-metrics require darwin&cgo:
//go:build (darwin && cgo) || freebsd || linux || windows || aix
// +build darwin,cgo freebsd linux windows aix
We'll probably need a separate fix for this, as I assume cgo is disabled for a reason.
alright, quick PR here to hopefully help with CI: https://github.com/elastic/elastic-agent-system-metrics/pull/55
Alright. This now can handle zombie states for processes. Also managed to test most of the major use cases manually (sudden stop & start without removing the lockfile, starting two beats at once, and re-starting while a zombie process remains), which was a bit of a pain. We should be good for the zombie issue now, at least.
Curious how you trigger zombie processes. Were you using kill -STOP <pid>
?
Curious how you trigger zombie processes.
@joshdover You just spawn a child process, then have the child die without the parent cleaning up the handle via waitpid()
Easier in C, where there's no runtime to manage things for you.
Will CI be happy now? Let's see.
Alright, still running into issues with the system-metrics builds. I forgot to add a build target for bsd, which explains a few of them, but I'm still not sure why the builds are failing on darwin:
[2022-09-29T20:14:45.101Z] --> darwin/amd64 error: exit status 2
[2022-09-29T20:14:45.102Z] Stderr: # github.com/elastic/elastic-agent-system-metrics/metric/system/process
[2022-09-29T20:14:45.102Z] ../../../../../pkg/mod/github.com/elastic/elastic-agent-system-metrics@v0.4.5-0.20220929185205-8b2e387cd113/metric/system/process/process_common.go:220:24: init.FetchPids undefined (type Stats has no field or method FetchPids)
[2022-09-29T20:14:45.102Z] ../../../../../pkg/mod/github.com/elastic/elastic-agent-system-metrics@v0.4.5-0.20220929185205-8b2e387cd113/metric/system/process/process_common.go:241:20: undefined: GetInfoForPid
My understanding is that the CI environment does have cgo enabled, and the other files that import this library don't have a build constraint on darwin&cgo
, which is...weird.
Going to dig a little further. @cmacknz the jenkins environment has cgo enabled, right? Also, for the future, it might be good to import some tests into elastic-agent-system-metrics similar to the libbeat-crosscompile
ones here, so I don't shoot myself in the foot quite as easily.
The actual solution here is to make the typecheck
github CI rule work with cgo, but I have no idea what's involved in that.
Realizing that there's nothing else in libbeat that has a buildtime dependency on cgo, which is...probably why this is so awkward.
This actually builds fine with darwin using the normal crossCompile. The issue seems to be with whatever is happening with gox
during the CI target.
Alright, I have an admittedly ugly hack for the cgo issues in libbeat's CI. In the mean time, a lot of the windows tests are failing, not sure why.
Current E2E error seems unrelated, as usual:
unable to find package Endpoint and Cloud Security
So, after talking with @leehinman , I decided that the use of shared locks was a tad confusing, particularly on windows. I changed the code to use exclusive locks, and moved around some related code.
Currently tested on Darwin, testing on Linux too.
Edit: tested on Linux, fine
So, made a few changes based on @andrewkroh 's feedback.
Two things to address still:
- As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.
- My hack to deal with github CI not liking cgo still really bugs me, and I'd like to find another way to handle it.
Ah, E2E tests are again being weird:
fatal: [18.218.130.4]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host", "unreachable": true}
- As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.
Yeah this makes sense. We should try testing this shortly after getting this in. I think this change is still valuable since we do not know that the queue flush is the only reason the beat is failing to clean this up.
- My hack to deal with github CI not liking cgo still really bugs me, and I'd like to find another way to handle it.
IMO we could solve this as a follow up after getting the PR in.
IMO we could solve this as a follow up after getting the PR in.
Agreed, this has been in a PR long enough. Gonna give the E2E tests one last chance to work, then merge.
As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.
As far as I can tell the Beats don't actually wait for the queue to empty by default (and this is what the documentation says as well). At least the ones I looked at do not obviously wait for the queue to drain before shutting down, since ShutdownTimeout
defaults to 0
.
https://github.com/elastic/beats/blob/27807aa9db397528640066a028bcd732b4a56eab/filebeat/beater/filebeat.go#L409-L412
https://github.com/elastic/beats/blob/27807aa9db397528640066a028bcd732b4a56eab/winlogbeat/beater/winlogbeat.go#L170-L176
I think at this point in the 8.5 release cycle with only a few days before the release we should target 8.5.1 instead of 8.5.0 with this. This change needs some soak time before release considering the complexity and the chance to create some new unintended failure mode where the beats refuse to start.
This means we should defer merging the backport to the 8.5 branch until after 8.5.0 is released to be safe.
This means we should defer merging the backport to the 8.5 branch until after 8.5.0 is released to be safe.
Agreed, this is kind of a scary change, would prefer not to backport anything last-minute.
This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏. For such, you'll need to label your PR with:
- The upcoming major version of the Elastic Stack
- The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)
To fixup this pull request, you need to add the backport labels for the needed branches, such as:
-
backport-v8./d.0
is the label to automatically backport to the8./d
branch./d
is the digit
Can we confirm if this will only make to 8.6.0? From what I see it will not be backported to 8.5
@lucabelluccini i'm pretty sure this goes into 8.5.1. @jlind23 keep me honest
Yes, I had planned to put this in 8.5.1. We are waiting for the 8.5.0 release to occur before back porting, as we are still generating 8.5.0 build candidates due to blockers.
@fearful-symmetry I've added the label to create the 8.5.0 backport, we are good to backport for inclusion in 8.5.1 now.