beats icon indicating copy to clipboard operation
beats copied to clipboard

add pid awareness to file locking

Open fearful-symmetry opened this issue 2 years ago • 17 comments

What does this PR do?

This PR seeks to add pid-awareness to the lockfile system that's initialized when a beat starts up. When beats attempts to create the lockfile, it checks for a pre-existing file, and if it exists, it checks to see if the process listed in the file exists or not. If the process does not exist, (in the case of a bad shutdown where the lockfile never got removed) it will attempt to recover, and create a new lockfile.

Why is it important?

In automated agent systems, the beat can sometimes take to long to shutdown, causing agent to hard-kill the process. In this instance, the beat won't restart, as the lockfile is still hanging around. This fixes that.

  • Closes #31670

Checklist

  • [ ] Test while running under agent
  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

fearful-symmetry avatar Sep 22 '22 23:09 fearful-symmetry

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-10-11T23:10:15.842+0000

  • Duration: 132 min 34 sec

Test stats :test_tube:

Test Results
Failed 0
Passed 23656
Skipped 1950
Total 25606

:green_heart: Flaky test report

Tests succeeded.

:robot: GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine avatar Sep 23 '22 05:09 elasticmachine

@cmacknz

Were you able to reproduce the problem? Did we test this manually using the entire beat process in addition to the unit tests?

Yup, tested with metricbeat under the usual cases (remaining lockfile for dead process, two beats trying to start at once)

fearful-symmetry avatar Sep 23 '22 19:09 fearful-symmetry

Okay, so, only issue that isn't dealt with is how to deal with zombie beats processes. It's definitely a problem, and I'm currently trying to find the most elegant way to handle this cross-platform.

fearful-symmetry avatar Sep 23 '22 21:09 fearful-symmetry

This PR now depends on: https://github.com/elastic/elastic-agent-system-metrics/pull/53

fearful-symmetry avatar Sep 26 '22 21:09 fearful-symmetry

Alright. This now can handle zombie states for processes. Also managed to test most of the major use cases manually (sudden stop & start without removing the lockfile, starting two beats at once, and re-starting while a zombie process remains), which was a bit of a pain. We should be good for the zombie issue now, at least.

fearful-symmetry avatar Sep 28 '22 19:09 fearful-symmetry

Alright, so the typecheck errors in CI are not related, but kinda baffling. I assume the issue (based on what I've hit before) is the fact that the github CI thing has cgo disabled, and the build constraints in elastic-agent-system-metrics require darwin&cgo:

//go:build (darwin && cgo) || freebsd || linux || windows || aix
// +build darwin,cgo freebsd linux windows aix

We'll probably need a separate fix for this, as I assume cgo is disabled for a reason.

fearful-symmetry avatar Sep 28 '22 20:09 fearful-symmetry

alright, quick PR here to hopefully help with CI: https://github.com/elastic/elastic-agent-system-metrics/pull/55

fearful-symmetry avatar Sep 28 '22 22:09 fearful-symmetry

Alright. This now can handle zombie states for processes. Also managed to test most of the major use cases manually (sudden stop & start without removing the lockfile, starting two beats at once, and re-starting while a zombie process remains), which was a bit of a pain. We should be good for the zombie issue now, at least.

Curious how you trigger zombie processes. Were you using kill -STOP <pid>?

joshdover avatar Sep 29 '22 13:09 joshdover

Curious how you trigger zombie processes.

@joshdover You just spawn a child process, then have the child die without the parent cleaning up the handle via waitpid() Easier in C, where there's no runtime to manage things for you.

fearful-symmetry avatar Sep 29 '22 18:09 fearful-symmetry

Will CI be happy now? Let's see.

fearful-symmetry avatar Sep 29 '22 19:09 fearful-symmetry

Alright, still running into issues with the system-metrics builds. I forgot to add a build target for bsd, which explains a few of them, but I'm still not sure why the builds are failing on darwin:

[2022-09-29T20:14:45.101Z] --> darwin/amd64 error: exit status 2

[2022-09-29T20:14:45.102Z] Stderr: # github.com/elastic/elastic-agent-system-metrics/metric/system/process

[2022-09-29T20:14:45.102Z] ../../../../../pkg/mod/github.com/elastic/elastic-agent-system-metrics@v0.4.5-0.20220929185205-8b2e387cd113/metric/system/process/process_common.go:220:24: init.FetchPids undefined (type Stats has no field or method FetchPids)

[2022-09-29T20:14:45.102Z] ../../../../../pkg/mod/github.com/elastic/elastic-agent-system-metrics@v0.4.5-0.20220929185205-8b2e387cd113/metric/system/process/process_common.go:241:20: undefined: GetInfoForPid

My understanding is that the CI environment does have cgo enabled, and the other files that import this library don't have a build constraint on darwin&cgo, which is...weird.

Going to dig a little further. @cmacknz the jenkins environment has cgo enabled, right? Also, for the future, it might be good to import some tests into elastic-agent-system-metrics similar to the libbeat-crosscompile ones here, so I don't shoot myself in the foot quite as easily.

fearful-symmetry avatar Sep 29 '22 20:09 fearful-symmetry

The actual solution here is to make the typecheck github CI rule work with cgo, but I have no idea what's involved in that.

fearful-symmetry avatar Sep 29 '22 20:09 fearful-symmetry

Realizing that there's nothing else in libbeat that has a buildtime dependency on cgo, which is...probably why this is so awkward.

fearful-symmetry avatar Sep 29 '22 23:09 fearful-symmetry

This actually builds fine with darwin using the normal crossCompile. The issue seems to be with whatever is happening with gox during the CI target.

fearful-symmetry avatar Sep 29 '22 23:09 fearful-symmetry

Alright, I have an admittedly ugly hack for the cgo issues in libbeat's CI. In the mean time, a lot of the windows tests are failing, not sure why.

fearful-symmetry avatar Sep 30 '22 00:09 fearful-symmetry

Current E2E error seems unrelated, as usual:

unable to find package Endpoint and Cloud Security

fearful-symmetry avatar Oct 04 '22 04:10 fearful-symmetry

So, after talking with @leehinman , I decided that the use of shared locks was a tad confusing, particularly on windows. I changed the code to use exclusive locks, and moved around some related code.

Currently tested on Darwin, testing on Linux too.

Edit: tested on Linux, fine

fearful-symmetry avatar Oct 04 '22 20:10 fearful-symmetry

So, made a few changes based on @andrewkroh 's feedback.

Two things to address still:

  • As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.
  • My hack to deal with github CI not liking cgo still really bugs me, and I'd like to find another way to handle it.

fearful-symmetry avatar Oct 10 '22 18:10 fearful-symmetry

Ah, E2E tests are again being weird:

fatal: [18.218.130.4]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host", "unreachable": true}

fearful-symmetry avatar Oct 10 '22 22:10 fearful-symmetry

  • As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.

Yeah this makes sense. We should try testing this shortly after getting this in. I think this change is still valuable since we do not know that the queue flush is the only reason the beat is failing to clean this up.

  • My hack to deal with github CI not liking cgo still really bugs me, and I'd like to find another way to handle it.

IMO we could solve this as a follow up after getting the PR in.

joshdover avatar Oct 11 '22 09:10 joshdover

IMO we could solve this as a follow up after getting the PR in.

Agreed, this has been in a PR long enough. Gonna give the E2E tests one last chance to work, then merge.

fearful-symmetry avatar Oct 11 '22 18:10 fearful-symmetry

As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.

As far as I can tell the Beats don't actually wait for the queue to empty by default (and this is what the documentation says as well). At least the ones I looked at do not obviously wait for the queue to drain before shutting down, since ShutdownTimeout defaults to 0.

https://github.com/elastic/beats/blob/27807aa9db397528640066a028bcd732b4a56eab/filebeat/beater/filebeat.go#L409-L412

https://github.com/elastic/beats/blob/27807aa9db397528640066a028bcd732b4a56eab/winlogbeat/beater/winlogbeat.go#L170-L176

cmacknz avatar Oct 11 '22 19:10 cmacknz

I think at this point in the 8.5 release cycle with only a few days before the release we should target 8.5.1 instead of 8.5.0 with this. This change needs some soak time before release considering the complexity and the chance to create some new unintended failure mode where the beats refuse to start.

This means we should defer merging the backport to the 8.5 branch until after 8.5.0 is released to be safe.

cmacknz avatar Oct 11 '22 19:10 cmacknz

This means we should defer merging the backport to the 8.5 branch until after 8.5.0 is released to be safe.

Agreed, this is kind of a scary change, would prefer not to backport anything last-minute.

fearful-symmetry avatar Oct 11 '22 19:10 fearful-symmetry

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

mergify[bot] avatar Oct 11 '22 19:10 mergify[bot]

Can we confirm if this will only make to 8.6.0? From what I see it will not be backported to 8.5

lucabelluccini avatar Oct 21 '22 17:10 lucabelluccini

@lucabelluccini i'm pretty sure this goes into 8.5.1. @jlind23 keep me honest

amitkanfer avatar Oct 22 '22 17:10 amitkanfer

Yes, I had planned to put this in 8.5.1. We are waiting for the 8.5.0 release to occur before back porting, as we are still generating 8.5.0 build candidates due to blockers.

cmacknz avatar Oct 24 '22 17:10 cmacknz

@fearful-symmetry I've added the label to create the 8.5.0 backport, we are good to backport for inclusion in 8.5.1 now.

cmacknz avatar Oct 31 '22 19:10 cmacknz