consul-k8s Graceful shutdown with injected envoy-sidecar

Envoy proxy sidecar receives SIGTERM at the exact same moment as my main container. In oppossite to my main container (which shuts down in like 15-30 seconds) envoy sidecar shuts down immediately (0.5 - 3s). This means that my main lost upstream connections and cannot gracefully shutdown - even rolling update means lost requests/data.

There should be some kind of mechanism so the upstream listeners exits as last. My main container is Consul Connect enabled and communicate with upstreams through Connect but the service itself is not being accessed through Connect but through Consul DNS instead.

Is there a workaround/hack (some kind of prestop sleep) or do I have get rid of Consul Connect?

Similar issues:

https://github.com/istio/istio/issues/7136
https://github.com/linkerd/linkerd2/issues/3747

Jun 16 '21 16:06 svobol13

Have you tried adding terminationGracePeriodSeconds to your pod? You might end up with https://github.com/hashicorp/consul-k8s/issues/540 instead then.

Jun 18 '21 10:06 pedrohdz

Thanks for the issue @svobol13! This is the problem that is a larger issue in Kubernetes itself in that there is no lifecycle hooks that can help us control container shutdown. We'd need to investigate how to work around this until there's a proper solution in k8s.

Jul 22 '21 04:07 ishustava

Posting a reference to a comment that I made after going down this rabbit hole, to hopefully save others some time: https://github.com/istio/istio/pull/18333#issuecomment-895522403

Aug 09 '21 20:08 kevin-lindsay-1

@pedrohdz I did some tests and terminationGracePeriodSeconds is not enough. As you can see in this image, the grace period starts after the SIGTERM signal, at this moment it's already too late.

The problem really comes from Envoy which should not stop without having completed the last received requests on SIGTERM.

My workaround was add a preStop: sleep 30s on my container + terminationGracePeriodSeconds and to overload the envoy docker image to ignore SIGTERM.

This work fine because consul remove the service from its catalog when the SIGTERM is triggered, so I have 30 second to finish the work.

Unfortunately ingress/terminating gateways suffer from the same problem and we cannot use the same workarounds... :confused: (preStop already set)

Dec 30 '21 10:12 Joxit

I think we need to do something similar to ECS: https://www.consul.io/docs/ecs/architecture#task-shutdown where we deregister the service immediately but keep Envoy running until the application container shuts down.

Jan 14 '22 22:01 lkysow

As a user I really need to be able to control the shutdown. In my case I have cli applications that are only using envoy for outbound connections. Some of these take 1-2 minutes to gracefully stop their current work after receiving the sigterm. During that period envoy needs to stay up and available. What happens now is we get errors because envoy shuts down very quickly. Being able to add a simple prestop hook with a sleep to envoy would make it simple for me to do this.

Jan 15 '22 02:01 dschaaff

Any updates on this? We just an annotation as suggested in https://github.com/hashicorp/consul-k8s/pull/911/files.

Here is how linkerd handles it https://linkerd.io/2.11/tasks/graceful-shutdown/.

I think this issue should be given really high priority as it is actually impossible to deployment in kubernetes with connection errors. The problem also arises any an HPA scales down pods.

Apr 13 '22 23:04 dschaaff

I've resorted to running a custom built binary with a patch containing the changes here https://github.com/hashicorp/consul-k8s/pull/911. It is the only way to run Consul Connect in production at present without getting 5xx errors during deployments and scale downs. The product definitely needs to look at this issue and make it a priority as this is a basic piece of it being production ready.

May 18 '22 15:05 dschaaff

Hi @dschaaff thanks for the feedback. We are monitoring this issue as well, aside from other items we have targeted for our next releases tied with Consul Core that are more architecturally related. I can't definitely say when we will address this but I do want to support a native solution in Consul K8s.

Jun 22 '22 00:06 david-yu

We are also facing this issue wherein our app has some draining configured. But as the pod receives SIGTERM, envoy immediately shuts down while the app is still draining. Graceful termination would be very important and helpful to us especially since its a high traffic app and any connection issues get quickly noticed and reported. As @dschaaff rightly mentioned this is important for release to production.

Aug 05 '22 12:08 narendrapatel

I continue to build a forked image of the control plane binary for each release in order to add a prestop hook to the envoy sidecar. It's quite disappointing that this feature hasn't been added. This issue has been open for over a year and this remains a blocker to production use of consul connect.

Aug 05 '22 17:08 dschaaff

Hi @dschaaff and @narendrapatel thanks for the feedback. I don't disagree that it is important to address and a blocker to getting to production. Right we are at a point of competing priorities due to large architecture changes within Consul that we are actively working on.

Aug 06 '22 17:08 david-yu

Hi @dschaaff, If possible, can you please share how are you building the image. Here is what I tried:

Forked the repo and checked out release(v0.46.1 in my case)
Added the annotation changes, ref: https://github.com/narendrapatel/consul-k8s/pull/1/files
Finally built the image with : make control-plane-dev-docker DEV_IMAGE=consul-k8s-control-plane:0.46.0

I tested it in my local setup and confirm it is working as expected but not sure of the build process.

Aug 11 '22 09:08 narendrapatel

I use this docker file to build

FROM public.ecr.aws/docker/library/golang:1.18.4-alpine3.15 as build
ARG TARGETOS
ARG TARGETARCH

COPY . /go

RUN cd /go/control-plane && \
	set -x; go build -o pkg/bin/consul-k8s-control-plane

# final image
# we are simply copying our custom built binary over the standard binary in the image
FROM hashicorp/consul-k8s-control-plane:0.46.1

ARG TARGETOS
ARG TARGETARCH

COPY --from=build /go/control-plane/pkg/bin/ /bin

Aug 11 '22 17:08 dschaaff

Does anyone use Consul Mesh/Connect in production? I can't understand how it can be used as is (without patch like this) and avoid errors in application, that needs time to finish jobs? maybe there is something new/fixed in 1.13.1?

Aug 25 '22 20:08 alt-dima

@dschaaff Thank you for the Dockerfile! I use it with a combination of custom-image and "dynamic entrypoint" (https://github.com/hashicorp/consul-k8s/issues/1397#issuecomment-1259492742)

Sep 27 '22 13:09 alt-dima

The big 1.0 rewrite has been released for a bit. Can anyone from HashiCorp comment on the timeline for fixing this issue? Due to the delay on this and other bugs we are facing we are considering dropping the Consul service mesh.

Dec 06 '22 20:12 dschaaff

@dschaaff Were taking this issue very seriously and have a solid idea of potential fixes to alleviate the problems you're having. The timeline is a little gray but with a medium, to strong probability, you will see a fix in the 2023 calendar year.

Dec 09 '22 17:12 nrichu-hcp

Any news on this issue ?

Feb 24 '23 10:02 coconut30

Hi @nrichu-hcp is there any update? We have multiple teams looking to go live in the next quarter with Consul Service Mesh, which we have spent 2 years arguing for vs Istio / ASM. This issue could easily force us to have to abandon Consul and migrate everyone to Istio/ASM. If we could just have a quarter date, that would be great.

Apr 13 '23 16:04 oliver-buckley-salmon-db

We abandoned the Consul mesh after 2 years in production. We had to run a forked build of the controller to enable adding the pre stop sleep that entire time. In the end we switched to linkerd.

Apr 13 '23 18:04 dschaaff

Hi, @nrichu-hcp it looks like this PR fixes the issue, any idea when it will be merged and what release it will be in?

Apr 14 '23 07:04 oliver-buckley-salmon-db

consul-k8s consul-k8s copied to clipboard

Graceful shutdown with injected envoy-sidecar

consul-k8s
consul-k8s copied to clipboard