grpc-health-probe
grpc-health-probe copied to clipboard
grpc-health-probe shuts down the gRPC connection non-gracefully
The probe process sends a TCP RST at the conclusion of the gRPC session instead of gracefully shutting down with a FIN/ACK exchange. This causes a lot of noise logging in k8s pods such as:
Feb 01, 2020 2:54:05 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated INFO: Transport failed io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
Screen-shot of Wireshark packet exchange:

This came up in #33 as well. However it does not surface up in gRPC implementations other than Java ––so I might need some help debugging this.
The code is rather short, maybe take a look to see if we're missing a conn.Close()
. I'll take a look as well.
Welp, it seems like we do a bunch of os.Exit(..)
which causes defer conn.Close()
to be not given a chance to be executed.
Should be a rather easy thing to fix. Would you be able to compile a binary to test if I can give you a patch?
See #36, can you compile and let me know if that helps?
Hi Ahmet, thanks a lot for your quick response! I certainly can, but I need to head out in a minute. I will do it first thing in the morning (PST), is that OK?
I just built the updated code and tested the binary, but unfortunately I am still getting the same errors.
Can you please try this. Change the defer conn.Close()
to
defer func(){
log.Println("closing connection")
conn.Close()
log.Println("closed connection")
}()
and recompile & run. We should be seeing these clear messages on the exit path of the probe.
The messages are printed out but the issue remains. I passed this on to one of my colleagues that knows go to see if he can spot anything.
That’s indeed peculiar. Is the wireshark dump the same, or are you seeing RST?
My coworker made a few changes as an experiment and it still sends RST, same as before. I am as confused as you are, typically sockets are closed gracefully by the O/S when a process dies. I don't really have any more cycles to experiment further unfortunately, but please feel free to update if you make a breakthrough.
My idea would be adding time.Sleep(time.Second*2) before os.Exit() to see if we can keep the process alive for connection to close.
But since you applied the code block above and saw closed connection
in the output, honestly I have no other answer.
One good litmus test would be to see if this happens to all Go gRPC clients calling Java gRPC servers. (I assume not, but it might help bisect where the problem is.) For that, you can maybe write a small program that makes an RPC and exits.
I went ahead and merged the #36 to master, so you can build off of there.
NB: Issue is still not fixed by #36.
For issue #36, can we send SIGTERM instead of os.Exit(retCode) ?? Will that solve the issue ??
I don’t see why it would help.
is this being worked for next release? =)
is this being worked for next release? =)
No. We can’t repro the issue in other languages. Furthermore we don’t know what’s causing it. If you have ideas or time to debug, please contribute.
I think I'm seeing something (perhaps) similar to this with a Go service?
With this health service:
package main
import (
"log"
"net"
"google.golang.org/grpc"
"google.golang.org/grpc/health"
"google.golang.org/grpc/health/grpc_health_v1"
"google.golang.org/grpc/reflection"
)
func main() {
lis, err := net.Listen("tcp", "localhost:50052")
if err != nil {
log.Fatal("failed to listen")
}
h := health.NewServer()
// Docs say that an empty service name is used to indicate "overall server health",
// which is the only thing we care about. We only have one service. Refer to
// https://github.com/grpc/grpc/blob/master/doc/health-checking.md
h.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)
s := grpc.NewServer()
grpc_health_v1.RegisterHealthServer(s, h)
// we're happy to do this because we can use grpcurl without having to provide a .proto file
reflection.Register(s)
log.Print("Starting healthcheck")
if err := s.Serve(lis); err != nil {
log.Fatal("failed to serve healthcheck")
}
}
Running :
GRPC_GO_LOG_VERBOSITY_LEVEL=2 GRPC_GO_LOG_SEVERITY_LEVEL=info go run main.go
And triggering the health_probe in another terminal:
grpc-health-probe -addr localhost:50052
I'm seeing this WARNING in the service logs:
WARNING: 2020/04/21 14:02:57 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:50052->127.0.0.1:53958: read: connection reset by peer
INFO: 2020/04/21 14:02:57 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
I wasn't sure if this was intended behaviour so I changed:
defer conn.Close()
In the grpc-health-probe/main.go
to:
defer func(){
time.Sleep(time.Second)
conn.Close()
}()
And when running that against the service above I only see:
INFO: 2020/04/21 14:03:20 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
In the logs.
I did a bit more munging and found that by inserting:
time.Sleep(time.Millisecond)
Between the Check
returning and conn.Close()
(I put it on line line 234) being called seems to prevent this warning as well.
Any advice how should we proceed here?
I'm not in favor of adding notable sleeps here and there since I expect defer conn.Close()
is actually supposed to close the connection.
This is a known issue already reported: https://github.com/grpc-ecosystem/grpc-health-probe/issues/34
We spent some time researching it to no avail as you can see in the thread.
I have long been thinking about writing a small C client (to be closer to the metal and actually have control) to replace the go implementation, but never seem to get the time.
On Wed, Oct 28, 2020 at 1:39 PM Ahmet Alp Balkan [email protected] wrote:
Any advice how should we proceed here? I'm not in favor of adding notable sleeps here and there since I expect defer conn.Close() is actually supposed to close the connection.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grpc-ecosystem/grpc-health-probe/issues/34#issuecomment-718195406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKDFQHDBIZV55WWBGSYEPXLSNB6RVANCNFSM4KQAQGPA .
I can testify to seeing the same error in a Go service like https://github.com/grpc-ecosystem/grpc-health-probe/issues/34#issuecomment-616921552.
I suspect this issue is caused by https://github.com/grpc/grpc-go/issues/460
We just hit this problem while running the prober on our GKE services. We were observing slow leak of bpf_jit memory allocations on the node, counted with
jq -n "`sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{ print $2 }' | jq -s 'add'` / `sudo cat /proc/sys/net/core/bpf_jit_limit`"
In weeks time it exceeds the bpf_jit_limit, and the node is in a state when starting a new container fails with
OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: setns_init_linux.go:88: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: errno 524: unknown
The node needs a restart to recover.
Workarounds
We have observed that on 3 independent services. Using the gke built in grpc prober solves the issue. Also adding a sleep of 100ms (1ms is not enough in our case) as suggested in https://github.com/grpc-ecosystem/grpc-health-probe/issues/34#issuecomment-616921552 works.
Proper fix
Is there any idea for a proper fix, not involving a sleep here? (or maybe https://github.com/grpc/grpc-go/issues/460 needs more prioritization in case it's really this)
As a side note. Do you think it's worth removing the issue's "harmless" tag? As our case shows, in certain circumstance this issue can cause troubles.