yabeda-puma-plugin icon indicating copy to clipboard operation
yabeda-puma-plugin copied to clipboard

Support phased restarts with standalone exporter

Open botimer opened this issue 2 years ago • 4 comments

Addresses #22

This change adds some additional tests to cover clustered mode with the standalone exporter. The phased restart example fails without the patch to the exporter, which modifies it to handle the events differently:

  • The metrics server variable is moved outside of the on_booted handler so it is not created separately in the forked workers
  • The on_booted handler safely checks whether the server is running before attempting to start
  • The on_stopped and on_restart events are caught and issue a synchronous stop, rather than the previous state hook, which does not signal restart

The construction is a little odd, with the start_server lambda assigning the metrics server instance, but I found the behavior quite fickle when trying to reuse an instance. This was the most concise solution I could come up with.

botimer avatar Feb 10 '22 16:02 botimer

Something I noticed locally was some inconsistency when running the tests under Docker. I think it has to do with the stream handling of docker-compose and IO.popen. Sometimes, the tests would run straight through, but sometimes they would hang, and control-C would nudge them along. I tried a bunch of variations with Process.spawn, fork, Thread.new, and Open3, but this was the most consistently behaved I could get it.

It looks like the CI tests are hanging in the same way. I'm happy to take amendments on how to get this working under Docker; that's obviously important to overall health. I'm just out of ideas on what to try.

botimer avatar Feb 10 '22 16:02 botimer

I've found why the tests are hanging under Docker, though I'm not sure of the best approach to resolving it. Using IO.popen, Process.spawn, etc., the command is run with sh -c under Docker. This means that the puma process that can respond to the USR1, USR2, and INT signals is not returned as the pid.

https://stackoverflow.com/questions/45429004/cant-send-signals-to-process-created-by-pty-spawn-in-ruby

The two equally unsavory options (because it adds weird conditionality between regular hosts and containers) appear to be using pgrep or a pid file. Is there any advice or preference between these?

botimer avatar Feb 16 '22 18:02 botimer

Wow. Thanks a lot for the pull request!

Can you please tell a bit more why you're using standalone process for exporter? Also some quick note to README about that would be awesome.

I will be able to take a thorough look to this pull request in a week or two, don't feel lost ;-)

Envek avatar Feb 16 '22 20:02 Envek

Hello! Sure... We are using the standalone exporter to take metrics out of the main puma thread pool and listen on a separate port dedicated to Prometheus metrics with firewall rules, rather than using path-based restriction or fixed token authorization. The first point is more important for one of our applications, which occasionally has long running requests that fill the thread pool. When that happens, the metrics are not available, because those requests are blocked behind regular app requests. We would like to use the continuous metrics to better understand exactly where the issues are without aggravating them.

We are also using clustered mode and looking to use phased restarts for periodic recycling of the workers without interruption to the others, or the whole app. This is because we must occasionally use significant amounts of ram, which the Ruby garbage collector never releases. If left running for long periods, the workers can balloon, but with periodic restarts, we can preserve considerable total headroom. Admittedly, it is a workaround of sorts, but we decided to use something "built in" for now, rather than moving this up to higher level orchestration. The crash from trying to rebind the port is the only apparent hurdle to gaining some steadier behavior and better insight into the activity when the app becomes troubled.

As far as the README, I'm not sure that an addition is needed. The "On different port" section describes how to set up the standalone mode and why you might use it. Nothing there should change; there are no extra requirements or config changes for phased restarts. They just don't work with standalone mode right now.

botimer avatar Feb 16 '22 22:02 botimer

Hey @botimer. Sorry for the late reply (2022 is a little bit tough year).

I refreshed your pull request in phased-restart branch in this yabeda repo, feel free to pull it into your pull requests (it seems that you have unchecked “Allow edits from maintainers” checkbox so I can't push it right into your pull request).

Tests are still hanging in GitHub Actions, haven't chance to look into it (@skryukov, would you like to take a look?)

Envek avatar Oct 21 '22 14:10 Envek

Thanks a lot for your contribution! Released in 0.7.0

Envek avatar Oct 24 '22 10:10 Envek

Thanks for the update and release. It looks like you were able to do what you needed and the PR was tracked as merged, so I will delete the originating branch.

botimer avatar Oct 24 '22 14:10 botimer