consul-k8s
consul-k8s copied to clipboard
Graceful shutdown with injected envoy-sidecar
Envoy proxy sidecar receives SIGTERM at the exact same moment as my main container. In oppossite to my main container (which shuts down in like 15-30 seconds) envoy sidecar shuts down immediately (0.5 - 3s). This means that my main lost upstream connections and cannot gracefully shutdown - even rolling update means lost requests/data.
There should be some kind of mechanism so the upstream listeners exits as last. My main container is Consul Connect enabled and communicate with upstreams through Connect but the service itself is not being accessed through Connect but through Consul DNS instead.
Is there a workaround/hack (some kind of prestop sleep) or do I have get rid of Consul Connect?
Similar issues:
- https://github.com/istio/istio/issues/7136
- https://github.com/linkerd/linkerd2/issues/3747
Have you tried adding terminationGracePeriodSeconds
to your pod? You might end up with https://github.com/hashicorp/consul-k8s/issues/540 instead then.
Thanks for the issue @svobol13! This is the problem that is a larger issue in Kubernetes itself in that there is no lifecycle hooks that can help us control container shutdown. We'd need to investigate how to work around this until there's a proper solution in k8s.
Posting a reference to a comment that I made after going down this rabbit hole, to hopefully save others some time: https://github.com/istio/istio/pull/18333#issuecomment-895522403
@pedrohdz I did some tests and terminationGracePeriodSeconds
is not enough. As you can see in this image, the grace period starts after the SIGTERM signal, at this moment it's already too late.
The problem really comes from Envoy which should not stop without having completed the last received requests on SIGTERM.
My workaround was add a preStop: sleep 30s
on my container + terminationGracePeriodSeconds
and to overload the envoy docker image to ignore SIGTERM.
This work fine because consul remove the service from its catalog when the SIGTERM is triggered, so I have 30 second to finish the work.
Unfortunately ingress/terminating gateways suffer from the same problem and we cannot use the same workarounds... :confused: (preStop already set)
I think we need to do something similar to ECS: https://www.consul.io/docs/ecs/architecture#task-shutdown where we deregister the service immediately but keep Envoy running until the application container shuts down.
As a user I really need to be able to control the shutdown. In my case I have cli applications that are only using envoy for outbound connections. Some of these take 1-2 minutes to gracefully stop their current work after receiving the sigterm. During that period envoy needs to stay up and available. What happens now is we get errors because envoy shuts down very quickly. Being able to add a simple prestop hook with a sleep to envoy would make it simple for me to do this.
Any updates on this? We just an annotation as suggested in https://github.com/hashicorp/consul-k8s/pull/911/files.
Here is how linkerd handles it https://linkerd.io/2.11/tasks/graceful-shutdown/.
I think this issue should be given really high priority as it is actually impossible to deployment in kubernetes with connection errors. The problem also arises any an HPA scales down pods.
I've resorted to running a custom built binary with a patch containing the changes here https://github.com/hashicorp/consul-k8s/pull/911. It is the only way to run Consul Connect in production at present without getting 5xx errors during deployments and scale downs. The product definitely needs to look at this issue and make it a priority as this is a basic piece of it being production ready.
Hi @dschaaff thanks for the feedback. We are monitoring this issue as well, aside from other items we have targeted for our next releases tied with Consul Core that are more architecturally related. I can't definitely say when we will address this but I do want to support a native solution in Consul K8s.
We are also facing this issue wherein our app has some draining configured. But as the pod receives SIGTERM, envoy immediately shuts down while the app is still draining. Graceful termination would be very important and helpful to us especially since its a high traffic app and any connection issues get quickly noticed and reported. As @dschaaff rightly mentioned this is important for release to production.
I continue to build a forked image of the control plane binary for each release in order to add a prestop hook to the envoy sidecar. It's quite disappointing that this feature hasn't been added. This issue has been open for over a year and this remains a blocker to production use of consul connect.
Hi @dschaaff and @narendrapatel thanks for the feedback. I don't disagree that it is important to address and a blocker to getting to production. Right we are at a point of competing priorities due to large architecture changes within Consul that we are actively working on.
Hi @dschaaff, If possible, can you please share how are you building the image. Here is what I tried:
- Forked the repo and checked out release(v0.46.1 in my case)
- Added the annotation changes, ref: https://github.com/narendrapatel/consul-k8s/pull/1/files
- Finally built the image with :
make control-plane-dev-docker DEV_IMAGE=consul-k8s-control-plane:0.46.0
I tested it in my local setup and confirm it is working as expected but not sure of the build process.
I use this docker file to build
FROM public.ecr.aws/docker/library/golang:1.18.4-alpine3.15 as build
ARG TARGETOS
ARG TARGETARCH
COPY . /go
RUN cd /go/control-plane && \
set -x; go build -o pkg/bin/consul-k8s-control-plane
# final image
# we are simply copying our custom built binary over the standard binary in the image
FROM hashicorp/consul-k8s-control-plane:0.46.1
ARG TARGETOS
ARG TARGETARCH
COPY --from=build /go/control-plane/pkg/bin/ /bin
Does anyone use Consul Mesh/Connect in production? I can't understand how it can be used as is (without patch like this) and avoid errors in application, that needs time to finish jobs? maybe there is something new/fixed in 1.13.1?
@dschaaff Thank you for the Dockerfile! I use it with a combination of custom-image and "dynamic entrypoint" (https://github.com/hashicorp/consul-k8s/issues/1397#issuecomment-1259492742)
The big 1.0 rewrite has been released for a bit. Can anyone from HashiCorp comment on the timeline for fixing this issue? Due to the delay on this and other bugs we are facing we are considering dropping the Consul service mesh.
@dschaaff Were taking this issue very seriously and have a solid idea of potential fixes to alleviate the problems you're having. The timeline is a little gray but with a medium, to strong probability, you will see a fix in the 2023 calendar year.
Any news on this issue ?
Hi @nrichu-hcp is there any update? We have multiple teams looking to go live in the next quarter with Consul Service Mesh, which we have spent 2 years arguing for vs Istio / ASM. This issue could easily force us to have to abandon Consul and migrate everyone to Istio/ASM. If we could just have a quarter date, that would be great.
We abandoned the Consul mesh after 2 years in production. We had to run a forked build of the controller to enable adding the pre stop sleep that entire time. In the end we switched to linkerd.
Hi, @nrichu-hcp it looks like this PR fixes the issue, any idea when it will be merged and what release it will be in?