numaflow icon indicating copy to clipboard operation
numaflow copied to clipboard

Autoscaling behavior is not very effective out of the box for applications with slower startup times

Open Rooknj opened this issue 7 months ago • 0 comments

Summary

The goal for our service is to be able to smoothly handle incoming event traffic from a Kafka Topic with high traffic (TPS >8k). One other thing to note is that our service using numaflow has a relatively slow startup time (~1-2 minutes) and is running on the JVM which requires a little bit of warm-up to maximize performance.

What we noticed with Numaflow is that there is a disconnect between the consumer lag of the kafka topic and the sink which ends up processing all the messages causing the autoscaling behavior out of the box not to work as expected.

First it ramps up too slowly, with the defaults being 2 additional instances at a time. And it only ramps up after an unacceptable pile-up of events. It should be easier to scale up more aggressively and proactively. Second, even though there was a massive backlog of messages in kafka, the numaflow sink would oscilate between scaling up and down. My hypothesis is that this behavior is caused by the ISB buffer filling up extremely quickly causing no new messages to be entered. Then the sink scales up to deal with the full buffer. This scaling up takes a minute due to the applications slower startup time meanwhile the buffer is completely full. Since the source can't insert any more events into the ISB, it is stuck while the sink is scaling up. Then the sink scales up and clears the buffer. After it is cleared, the pods scale down. Then the source starts to fill up the buffer again and the cycle repeats.

My expectation would be that the autoscaling behavior would more closely mirror this dynamic: If the kafka topic has a high amount of consumer lag (unprocessed messages), the entire numaflow pipeline appropriately scales up to smoothly process the large backlog.

Use Cases

When would you use this?


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

Rooknj avatar Jul 19 '24 18:07 Rooknj