Pipeline Invocations Sporadically Times Out When a Dataflow Engine Pods Gets Rolled
I see quite a bit of work being done in 2.9 and 2.10 to improve Pipeline stability. And my team is looking to use Pipeline in production environment. So we are conducting some tests. In our EKS clusters, we have a Karpenter group dedicated to Seldon. To ensure nodes are fresh, Karpenter will drain nodes every 6 days. Thus I want simulate what happens(on 2.10 with MSK cluster) when a node gets drained. To do this:
- Started a locust test with 2 workers to invoke the Pipeline.
- While the test is running, rolled one of the seldon-dataflow-engine pods and observe the number of invocation fails.
I see the Pipeline object status PipelineReady switched to False. After about 1 minute, the Pipeline objects recovered. During that time, there were some successful requests. But most requests timed out. (request timeout set to 5 s)
Since all our pods has to be rolled every 6 days, this introduces at least a minute of model downtime every 6 days. Is there any configuration changes on SeldoI can make to Seldon to reduce that 1 minute window?
@charleschangdp In order to help you with the issue, it would be helpful to understand more about the setup of the test and the way Seldon Core 2 is configured:
- What is the status of the
ModelsReadycondition on the Pipeline object whenPipelineReadyswitches toFalse? - How many partitions does each Kafka topic associated with the pipeline have? This is set via the
kafka.topics.numPartitionshelm chart value, and can be inspected for a running cluster by looking at the value of theSELDON_KAFKA_PARTITIONS_DEFAULTenvironment variable for any of thedataflow-enginepods. - What is the value set for
SeldonConfig.config.scalingConfig.pipelines.maxShardCountMultiplier? - How many replicas of
dataflow-engine,pipeline-gatewayandmodel-gatewayexist within the cluster? - How many pipelines are loaded into the cluster running the load test?
- Do inference servers that have loaded models corresponding to the tested pipeline: a. have multiple replicas, with the models also having multiple replicas? b. are running on the same node that was drained by Karpenter? c. have Pod Disruption Budgets (PDB) set so that at least one replica remains up at all times (implicitly, this also means that at least one model replica remains up at all times)?
If you have the number of partitions of the kafka topics set to 1, or if models that are part of the pipeline have a single replica and the server they are loaded on is on the same node that's being drained by Karpenter, then the downtime is expected (and avoidable by increasing the number of partitions, or number of replicas of the model and server and configuring PDBs). If not, we can look further into replicating the behaviour you're experiencing.
Thank you for reaching out, we're glad you're looking into using Pipelines in a production environment, and are here to help. If you already have a contractual relationship with Seldon, we can provide more support through non-public channels as well. However, for this type of issues (where multiple users of Seldon Core may be impacted or interested in the outcome) I think it's very important to also surface such discussions in a public forum (i.e here).
-
ModelsReadyremainsTrue - 4 partitions
- 4
- 4 replicas for each
- 1
- Inference servers are set to 2 replicas that are scheduled to run on different nodes. And it has PDB.
We already have a contractual relationship with Seldon. But happy to discuss this publicly.
I also want to be more precise when I said But most requests timed out. (request timeout set to 5 s). I set the Locust test to run on 4 workers with 5 second timeout, which is about 32 RPS. When a dataflow engine pod is rolled, getting about 0.5 to 1 failed RPS which successful RPS is somewhere between 5 and 10. So the long timeout is preventing other requests from being sent. If timeout reduced, there would be more failed requests and more successful requests.
Thank you for the update, we'll look into this.
Some of the degradation in throughput will be caused by one dataflow-engine replica going away, but not all of it -- after the rolling of one dataflow-engine replica, the remaining replicas now take part in a cooperative rebalancing protocol controlled by Kafka Streams -- essentially redistributing the Kafka topic partitions amongst themselves. While this protocol is not a "stop-the-world" one, it likely does have an impact on the throughput.
That being said, the drop in throughput shouldn't be as large as seen here (but some of it is also an artifact of the load test being run with a fixed number of workers)
It would actually be interesting if you had the metric for how many requests are timing out. The median latency or some other percentiles would help if Locust doesn't expose the timeout metric directly. We know it's at least 5% requests timing out, from the 95th percentile graph.
A load test that would show the real drop in throughput (and the real time-out percentage) would be an open workload model that tries to maintain ~30 RPS (inference request), because that will decouple the drop in throughput caused by workers that time out (and actually pull back when the system gets slow) from the actual drop experienced because of the replica going away and the rebalancing happening between the other replicas.
I can run such a test on our end.
I agree that a test that targets a specific RPS would provide some more insights. For the Locust test, since we know the Pipeline is impaired for about a minute, I ran a test for about 45 seconds.
That's helpful, thank you for the additional test and details.
The response time statistics show that less than 10% of requests time out, but at least 5% (because the 90th percentile is well below timeout and the 95th is at timeout). Of course the real number will be higher without the workers slowing down their requests when the system is slow.
However, the latency profile data above also helps us better simulate your case in a load test at constant RPS (we use k6 for internal load testing). This is because we can use a synthetic model that mimics the latency profile of the lower percentiles to try and replicate.
@lc525 Following up to see if you have made any progress on this topic.
We've implemented a Karpenter node group with ondemand instances dedicated to data-flow-engine pods. Every X number of days, Karpenter still need to roll these nodes per our infosec policy. Thus those pods will get drained and move to a new node. We have a Pipeline that receives live traffic with fallback in place. So I have some data from the client side metrics for % of calls are successful. Some observations
- The pipeline is usually impair anywhere between 5 to 20 minutes.
- During each of impairment periods, we are seeing anywhere from 15% to 25% of the traffic failing.