grpc-health-probe icon indicating copy to clipboard operation
grpc-health-probe copied to clipboard

grpc-health-probe shuts down the gRPC connection non-gracefully

Open JanDash opened this issue 5 years ago • 21 comments

The probe process sends a TCP RST at the conclusion of the gRPC session instead of gracefully shutting down with a FIN/ACK exchange. This causes a lot of noise logging in k8s pods such as:

Feb 01, 2020 2:54:05 PM io.grpc.netty.shaded.io.grpc.netty.NettyServerTransport notifyTerminated INFO: Transport failed io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

Screen-shot of Wireshark packet exchange:

Screen Shot 2020-02-04 at 9 25 50 AM

JanDash avatar Feb 04 '20 23:02 JanDash

This came up in #33 as well. However it does not surface up in gRPC implementations other than Java ––so I might need some help debugging this.

The code is rather short, maybe take a look to see if we're missing a conn.Close(). I'll take a look as well.

ahmetb avatar Feb 04 '20 23:02 ahmetb

Welp, it seems like we do a bunch of os.Exit(..) which causes defer conn.Close() to be not given a chance to be executed.

Should be a rather easy thing to fix. Would you be able to compile a binary to test if I can give you a patch?

ahmetb avatar Feb 04 '20 23:02 ahmetb

See #36, can you compile and let me know if that helps?

ahmetb avatar Feb 05 '20 00:02 ahmetb

Hi Ahmet, thanks a lot for your quick response! I certainly can, but I need to head out in a minute. I will do it first thing in the morning (PST), is that OK?

JanDash avatar Feb 05 '20 01:02 JanDash

I just built the updated code and tested the binary, but unfortunately I am still getting the same errors.

JanDash avatar Feb 05 '20 20:02 JanDash

Can you please try this. Change the defer conn.Close() to

defer func(){
    log.Println("closing connection")
    conn.Close()
    log.Println("closed connection")
}()

and recompile & run. We should be seeing these clear messages on the exit path of the probe.

ahmetb avatar Feb 05 '20 23:02 ahmetb

The messages are printed out but the issue remains. I passed this on to one of my colleagues that knows go to see if he can spot anything.

JanDash avatar Feb 06 '20 01:02 JanDash

That’s indeed peculiar. Is the wireshark dump the same, or are you seeing RST?

ahmetb avatar Feb 06 '20 02:02 ahmetb

My coworker made a few changes as an experiment and it still sends RST, same as before. I am as confused as you are, typically sockets are closed gracefully by the O/S when a process dies. I don't really have any more cycles to experiment further unfortunately, but please feel free to update if you make a breakthrough.

JanDash avatar Feb 06 '20 23:02 JanDash

My idea would be adding time.Sleep(time.Second*2) before os.Exit() to see if we can keep the process alive for connection to close.

But since you applied the code block above and saw closed connection in the output, honestly I have no other answer.

One good litmus test would be to see if this happens to all Go gRPC clients calling Java gRPC servers. (I assume not, but it might help bisect where the problem is.) For that, you can maybe write a small program that makes an RPC and exits.

ahmetb avatar Feb 07 '20 03:02 ahmetb

I went ahead and merged the #36 to master, so you can build off of there.

NB: Issue is still not fixed by #36.

ahmetb avatar Feb 07 '20 03:02 ahmetb

For issue #36, can we send SIGTERM instead of os.Exit(retCode) ?? Will that solve the issue ??

thejasn avatar Mar 12 '20 04:03 thejasn

I don’t see why it would help.

ahmetb avatar Mar 12 '20 06:03 ahmetb

is this being worked for next release? =)

jsabalos avatar Mar 17 '20 19:03 jsabalos

is this being worked for next release? =)

No. We can’t repro the issue in other languages. Furthermore we don’t know what’s causing it. If you have ideas or time to debug, please contribute.

ahmetb avatar Mar 18 '20 00:03 ahmetb

I think I'm seeing something (perhaps) similar to this with a Go service?

With this health service:

package main

import (
	"log"
	"net"
	"google.golang.org/grpc"
	"google.golang.org/grpc/health"
	"google.golang.org/grpc/health/grpc_health_v1"
	"google.golang.org/grpc/reflection"
)

func main() {

	lis, err := net.Listen("tcp", "localhost:50052")
	if err != nil {
		log.Fatal("failed to listen")
	}

	h := health.NewServer()

	// Docs say that an empty service name is used to indicate "overall server health",
	// which is the only thing we care about. We only have one service. Refer to
	// https://github.com/grpc/grpc/blob/master/doc/health-checking.md
	h.SetServingStatus("", grpc_health_v1.HealthCheckResponse_SERVING)

	s := grpc.NewServer()
	grpc_health_v1.RegisterHealthServer(s, h)

	// we're happy to do this because we can use grpcurl without having to provide a .proto file
	reflection.Register(s)

	log.Print("Starting healthcheck")

	if err := s.Serve(lis); err != nil {
		log.Fatal("failed to serve healthcheck")
	}

}

Running :

GRPC_GO_LOG_VERBOSITY_LEVEL=2 GRPC_GO_LOG_SEVERITY_LEVEL=info go run main.go

And triggering the health_probe in another terminal:

grpc-health-probe -addr localhost:50052

I'm seeing this WARNING in the service logs:

WARNING: 2020/04/21 14:02:57 transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:50052->127.0.0.1:53958: read: connection reset by peer
INFO: 2020/04/21 14:02:57 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

I wasn't sure if this was intended behaviour so I changed:

defer conn.Close()

In the grpc-health-probe/main.go to:

defer func(){
        time.Sleep(time.Second)
	conn.Close()
}()

And when running that against the service above I only see:

INFO: 2020/04/21 14:03:20 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

In the logs.

I did a bit more munging and found that by inserting:

time.Sleep(time.Millisecond)

Between the Check returning and conn.Close() (I put it on line line 234) being called seems to prevent this warning as well.

danieldooley avatar Apr 21 '20 02:04 danieldooley

Any advice how should we proceed here? I'm not in favor of adding notable sleeps here and there since I expect defer conn.Close() is actually supposed to close the connection.

ahmetb avatar Oct 28 '20 20:10 ahmetb

This is a known issue already reported: https://github.com/grpc-ecosystem/grpc-health-probe/issues/34

We spent some time researching it to no avail as you can see in the thread.

I have long been thinking about writing a small C client (to be closer to the metal and actually have control) to replace the go implementation, but never seem to get the time.

On Wed, Oct 28, 2020 at 1:39 PM Ahmet Alp Balkan [email protected] wrote:

Any advice how should we proceed here? I'm not in favor of adding notable sleeps here and there since I expect defer conn.Close() is actually supposed to close the connection.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grpc-ecosystem/grpc-health-probe/issues/34#issuecomment-718195406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKDFQHDBIZV55WWBGSYEPXLSNB6RVANCNFSM4KQAQGPA .

JanDash avatar Oct 28 '20 21:10 JanDash

I can testify to seeing the same error in a Go service like https://github.com/grpc-ecosystem/grpc-health-probe/issues/34#issuecomment-616921552.

owenthereal avatar May 30 '21 19:05 owenthereal

I suspect this issue is caused by https://github.com/grpc/grpc-go/issues/460

GeertJohan avatar Jan 26 '22 08:01 GeertJohan

We just hit this problem while running the prober on our GKE services. We were observing slow leak of bpf_jit memory allocations on the node, counted with

jq -n "`sudo cat /proc/vmallocinfo | grep bpf_jit | awk '{ print $2 }' | jq -s 'add'` / `sudo cat /proc/sys/net/core/bpf_jit_limit`"

In weeks time it exceeds the bpf_jit_limit, and the node is in a state when starting a new container fails with

OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: setns_init_linux.go:88: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: errno 524: unknown

The node needs a restart to recover.

Workarounds

We have observed that on 3 independent services. Using the gke built in grpc prober solves the issue. Also adding a sleep of 100ms (1ms is not enough in our case) as suggested in https://github.com/grpc-ecosystem/grpc-health-probe/issues/34#issuecomment-616921552 works.

Proper fix

Is there any idea for a proper fix, not involving a sleep here? (or maybe https://github.com/grpc/grpc-go/issues/460 needs more prioritization in case it's really this)

As a side note. Do you think it's worth removing the issue's "harmless" tag? As our case shows, in certain circumstance this issue can cause troubles.

xorver avatar Aug 24 '22 10:08 xorver