faust
faust copied to clipboard
Best practices for deploying Faust agent using Kubernetes
Checklist
- [ X] I have included information about relevant versions
- [ ] I have verified that the issue persists when using the
master
branch of Faust.
Steps to reproduce
I am trying to deploy a Faust agent to production env using 2 pods. The agent consumes from a topic that has 6 partitions. After the deploy the agent runs until it receives a SIGTERM(15) and the agent shut downs and stops consuming messages.
I am wondering if there are any best practices around deploys using kubernetes.
Expected behavior
Agent gracefully handles the sigterm.
Actual behavior
App shuts down and stop consuming messages
Versions
- Python version: 3.7
- Faust version: 1.10.4
@vishal-kvn I use a k8s deployment to run the Faust workers. I have configured the Faust app to auto discover the agents and the workers run indefinitely. This set up works fine to me.
@afausti Thanks for the reply. I will try it out.
@afausti Setting autodiscover=True
did not fix the above issue. Also, I noticed that you set the replicaCount to 1(https://github.com/lsst-sqre/charts/blob/master/charts/kafka-aggregator/values.yaml#L3) for your worker. Have you deployed with a replicaCount greater than 1? For my use case I have a replicaCount of 3 but I noticed that only 1 worker(pod) is consuming messages.
Please let me know if you came across this behavior.
A couple of questions:
-
How many partitions do you have on your topic? You need at minimum one partition per worker
-
Have you run "kubectl describe" on the pod after it is killed to get the status/event information? That should tell you why K8S is killing the pod
-
Do you have a readinessProbe and/or livenessProbe configured?
-
Are you allocating enough memory for the pods? OOMKilled is a very common reason for pods to get killed
Kubernetes will tell you what it doesn't like, you just need to look hard for it.
Hope this helps
@bobh66 Thanks for the reply.
-
How many partitions do you have on your topic? You need at minimum one partition per worker I have one topic that has 6 partitions.
-
Have you run "kubectl describe" on the pod after it is killed to get the status/event information? That should tell you why K8S is killing the pod I will be looking into this and will share more info.
-
Do you have a readinessProbe and/or livenessProbe configured? Yes. The pods pass the livenessProbe check.
-
Are you allocating enough memory for the pods? OOMKilled is a very common reason for pods to get killed I haven't seen a OOMKilled error in the logs and I have provisioned sufficient memory for the deploy.
-
Kubernetes will tell you what it doesn't like, you just need to look hard for it. Ack! I will take a closer look at the logs to find the root cause.
@afausti I see you're using the memory storage for Tables. Do you think you'd need to use a StatefulSet instead of a Deployment if you switched to rocksdb?
@taybin have you tried implementing a StatefulSet for Faust when using Rocksdb?
@vishal-kvn My Faust app is also getting a sigterm 15, though I'm running via docker-compose, not k8s. I'm wondering if this ever went anywhere for you?