faust icon indicating copy to clipboard operation
faust copied to clipboard

Best practices for deploying Faust agent using Kubernetes

Open vishal-kvn opened this issue 4 years ago • 8 comments

Checklist

  • [ X] I have included information about relevant versions
  • [ ] I have verified that the issue persists when using the master branch of Faust.

Steps to reproduce

I am trying to deploy a Faust agent to production env using 2 pods. The agent consumes from a topic that has 6 partitions. After the deploy the agent runs until it receives a SIGTERM(15) and the agent shut downs and stops consuming messages.

I am wondering if there are any best practices around deploys using kubernetes.

Expected behavior

Agent gracefully handles the sigterm.

Actual behavior

App shuts down and stop consuming messages

Versions

  • Python version: 3.7
  • Faust version: 1.10.4

vishal-kvn avatar Jul 31 '20 23:07 vishal-kvn

@vishal-kvn I use a k8s deployment to run the Faust workers. I have configured the Faust app to auto discover the agents and the workers run indefinitely. This set up works fine to me.

afausti avatar Jul 31 '20 23:07 afausti

@afausti Thanks for the reply. I will try it out.

vishal-kvn avatar Aug 01 '20 01:08 vishal-kvn

@afausti Setting autodiscover=True did not fix the above issue. Also, I noticed that you set the replicaCount to 1(https://github.com/lsst-sqre/charts/blob/master/charts/kafka-aggregator/values.yaml#L3) for your worker. Have you deployed with a replicaCount greater than 1? For my use case I have a replicaCount of 3 but I noticed that only 1 worker(pod) is consuming messages. Please let me know if you came across this behavior.

vishal-kvn avatar Aug 02 '20 11:08 vishal-kvn

A couple of questions:

  • How many partitions do you have on your topic? You need at minimum one partition per worker

  • Have you run "kubectl describe" on the pod after it is killed to get the status/event information? That should tell you why K8S is killing the pod

  • Do you have a readinessProbe and/or livenessProbe configured?

  • Are you allocating enough memory for the pods? OOMKilled is a very common reason for pods to get killed

Kubernetes will tell you what it doesn't like, you just need to look hard for it.

Hope this helps

bobh66 avatar Aug 02 '20 14:08 bobh66

@bobh66 Thanks for the reply.

  • How many partitions do you have on your topic? You need at minimum one partition per worker I have one topic that has 6 partitions.

  • Have you run "kubectl describe" on the pod after it is killed to get the status/event information? That should tell you why K8S is killing the pod I will be looking into this and will share more info.

  • Do you have a readinessProbe and/or livenessProbe configured? Yes. The pods pass the livenessProbe check.

  • Are you allocating enough memory for the pods? OOMKilled is a very common reason for pods to get killed I haven't seen a OOMKilled error in the logs and I have provisioned sufficient memory for the deploy.

  • Kubernetes will tell you what it doesn't like, you just need to look hard for it. Ack! I will take a closer look at the logs to find the root cause.

vishal-kvn avatar Aug 03 '20 02:08 vishal-kvn

@afausti I see you're using the memory storage for Tables. Do you think you'd need to use a StatefulSet instead of a Deployment if you switched to rocksdb?

taybin avatar Oct 28 '20 15:10 taybin

@taybin have you tried implementing a StatefulSet for Faust when using Rocksdb?

muaaaz avatar Nov 15 '21 10:11 muaaaz

@vishal-kvn My Faust app is also getting a sigterm 15, though I'm running via docker-compose, not k8s. I'm wondering if this ever went anywhere for you?

burbma avatar Mar 06 '23 18:03 burbma