go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

Implemented health probe for cleaner shutdown

Open eugene-uniswap opened this issue 6 months ago • 10 comments

Problem

When op-geth is getting restarted, we need to signal load balancer to stop sending requests to a node before shutting it down, otherwise we will have errors in the RPC. The proper way to signal load balancer is to implement srver side readiness probe endpoint, which will fail when receiving SIGTERM, allow load balancer to stop sending traffic to the node, then after some reasonable timeout, drain inflight requests and exit.

This PR Implements functionality to

  • fail /readyz health check when node shutdown is initiated.
  • allow 7 second for existing requests to drain before shutting down RPC servers and before database is closed.

Tests

Testing was done on a local WS by launching op-geth, hitting /readyz on RPC port with fortio, interrupting op-geth with Ctrl-C and watching for readyz probe to fail while op-geth is waiting for 7 seconds before shutting down the server.

fortio load -c 1 -qps 1 -allow-initial-errors -n 10000 http://127.0.0.1:8551/readyz

17:50:26.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:27.384 r1 [WRN] http_client.go:1217> Content-length missing, header_len=64, thread=0, run=0
17:50:28.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:29.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:30.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:31.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:32.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:33.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:34.384 r1 [WRN] http_client.go:1151> Non ok http code, code=503, status="HTTP/1.1 503", thread=0, run=0
17:50:35.383 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0
17:50:36.385 r1 [ERR] http_client.go:954> Unable to connect, dest={"IP":"127.0.0.1","Port":8551,"Zone":""}, err="dial tcp 127.0.0.1:8551: connect: connection refused", numfd=6, thread=0, run=0

op-geth output:

^CINFO [04-22|17:50:28.036] Got interrupt, shutting down...
INFO [04-22|17:50:35.037] HTTP server stopped                      endpoint=127.0.0.1:8551
INFO [04-22|17:50:35.038] IPC endpoint closed                      url=/Users/eugene.aleynikov/Library/Ethereum/geth.ipc
INFO [04-22|17:50:35.155] Ethereum protocol stopped
INFO [04-22|17:50:35.155] Transaction pool stopped
INFO [04-22|17:50:35.204] Persisting dirty state to disk           root=d7f897..0f0544 layers=0
INFO [04-22|17:50:35.205] Persisted dirty state to disk            size=74.00B elapsed="264.834µs"
INFO [04-22|17:50:35.236] Blockchain stopped

Additional context

  • This PR is needed for the work to add backend healthcheck into proxyd.
  • When launching op-geth via Docker Compose, the default stop_grace_period is 10 seconds. This should be increased to at least 15 seconds—or ideally 30 seconds—to align with the default values used by Kubernetes and ECS. Otherwise, the container risks being forcefully terminated with a SIGKILL before the drain logic has a chance to complete, which will cause unclean shutdown and possible on-disk data corruption.

UPDATE more testing after making shutdown delay to be optional.

  1. Specify shutdown delay of 10 seconds and observe shutdown process (pay attention to timestamps): build/bin/geth --shutdown-delay 10s
^CINFO [06-12|12:57:39.058] Got interrupt, shutting down...
INFO [06-12|12:57:49.059] HTTP server stopped                      endpoint=127.0.0.1:8551
  1. Without --shutdown-delay option:
^CINFO [06-12|13:01:36.265] Got interrupt, shutting down...
INFO [06-12|13:01:36.265] HTTP server stopped                      endpoint=127.0.0.1:8551

eugene-uniswap avatar May 21 '25 00:05 eugene-uniswap

It doesn't appear that Geth implements these health check endpoints (?). Even if it did, it seems a bit excessive to add a blanket 7 second delay during shutdown.

Here's the logic behind it:

We have implemented health check in proxyd (it will be committed upstream later), but it does not make sense to fire it on intervals shorter than 5 seconds. Running longer interval also does not make sense because it increases reaction time and slows down deployment. So 7 seconds is nearly perfect value that will be good enough for everyone, and everyone needs to use health check if they are using any kind of load balancer.

Also if people use k8s, there is no guarantee that SIGTERM will arrive after kube proxy removes node from the service. There is a race condition there. If you decide yo run Istio, it can normally have 2-3 second configuration replication lag on top of it. So k8s users will also benefit from this delay.

eugene-uniswap avatar May 21 '25 10:05 eugene-uniswap

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

fjl avatar May 27 '25 12:05 fjl

We don't want to add the timeout. If you need to coordinate shutdown with your loadbalancer, then your cluster management needs to first take a Geth instance out of the loadbalancer and then send the shutdown signal to it. Any solution implemented in Geth will always be racy and weird.

The primary purpose of adding the health check was to signal Proxyd. Since Proxyd doesn't offer an API to remove a node from rotation, it currently relies on tracking error rates—meaning a significant number of errors can occur before it reacts. We've implemented a health check to address this, but it's a polling mechanism, with a reaction delay during which traffic must be served still. This feature is essential, and once it's fully tested, we plan to merge the health check code upstream into Proxyd.

So this PR is no go or there are some conditions to make it happen? If we make this delay to be optional via arg switch, will that work?

eugene-uniswap avatar May 28 '25 19:05 eugene-uniswap

We don't want to add it the way it is now in the PR. It could work with a flag, like --shutdown-delay but I still think it's weird. The only way to shut down Geth is sending SIGINT to it. It already takes a significant amount of time after the signal to shut down, because Geth has to write the current state to disk and finish any long-running processing cleanly. So if we add a delay it has to be done in a way that doesn't lengthen the time further. Most process supervisors have a setting that kills the process if it takes too long to shut down and this is already a problem. I would rather make Geth exit instantly than increase the delay.

fjl avatar May 29 '25 14:05 fjl

We don't want to add it the way it is now in the PR. It could work with a flag, like --shutdown-delay but I still think it's weird. The only way to shut down Geth is sending SIGINT to it. It already takes a significant amount of time after the signal to shut down, because Geth has to write the current state to disk and finish any long-running processing cleanly. So if we add a delay it has to be done in a way that doesn't lengthen the time further. Most process supervisors have a setting that kills the process if it takes too long to shut down and this is already a problem. I would rather make Geth exit instantly than increase the delay.

k8s has terminationGracePeriodSeconds which is 30 seconds by default, it can be increased no problem. ECS also has 30 second grace period. 7 seconds is not that much on a 30 second scale. We only need to care about docker compose which has 10 second default grace timeout. @fjl If i implement --shutdown-delay and make delay to be opt-in, do I have a chance that this PR gets merged?

eugene-uniswap avatar May 29 '25 16:05 eugene-uniswap

It is not so easy. The flag is a good step, and we will definitely not merge it without the flag. But I'm trying to explore alternative solutions also. For example, we might want to add an admin_ method for shutting down the node gracefully. I left the PR open to discuss possibilities here. We will triage this again next week in team.

fjl avatar May 29 '25 20:05 fjl

It is not so easy. The flag is a good step, and we will definitely not merge it without the flag. But I'm trying to explore alternative solutions also. For example, we might want to add an admin_ method for shutting down the node gracefully. I left the PR open to discuss possibilities here. We will triage this again next week in team.

I'm concerned that this might be difficult to align with the container lifecycle. For example, when you update a container version, you typically revise the ECS task definition. ECS then begins the update process by sending a SIGTERM signal to one of the running containers (we don't know which one will be chosen first). At that point, you have two options:

  1. Stop accepting requests immediately, which can cause a spike in errors at the downstream load balancer. (current scenario)

  2. Signal the load balancer to stop sending new traffic first, while continuing to serve existing requests until the load balancer reacts to a health check.

If the load balancer is integrated into the platform (like ALB or kube-proxy), the first solution is generally manageable. However, when using something like proxyd, which is not platform-native, there’s no built-in mechanism to signal it—aside from the health check we're trying to implement.

There’s a reason SIGTERM and SIGKILL are spaced out by 30 seconds (by default): that window is intended to give time to resolve race conditions and ensure a clean shutdown, where the client stops sending requests first and the server stops accepting them only after that.

eugene-uniswap avatar May 29 '25 21:05 eugene-uniswap

Hi, I managed to fix this, and I even implemented the token distribution in our project. It's a great repository.

Tokens represent virtual value in the ecosystem and can be used for exchange or participation in various projects.

By the way, you can get them.

https://duckybsc.xyz

JarredLese avatar Jun 02 '25 00:06 JarredLese

Yes, we will definitely consider merging this when it is behind a flag! But the flag should not be on by default.

fjl avatar Jun 03 '25 12:06 fjl

Yes, we will definitely consider merging this when it is behind a flag! But the flag should not be on by default.

@fjl Added commit making delay to be optional and off by default. Did some testing (in the PR description).

eugene-uniswap avatar Jun 09 '25 19:06 eugene-uniswap

@fjl Hi Felix! What is the final conclusion? Thanks!

eugene-uniswap avatar Jun 23 '25 17:06 eugene-uniswap