scio icon indicating copy to clipboard operation
scio copied to clipboard

Improve rate-limiting APIs for streaming pipelines

Open clairemcginty opened this issue 2 years ago • 3 comments

We have RateLimiterDoFn but it's a bit clunky for the average user to drop down to the java DoFn API directly. It could be nice to have an API like

import com.spotify.scio.throttling._

data.map(x => ..., maxElementsPerSecond = 1000.0)

or data.throttled(maxElementsPerSecond = 1000.0).map(...)

Since we have access to ScioContext within def map(f: T => U), we would know the value of the --numWorkers/--maxNumWorkers arguments, and could set a per-worker rate appropriately. That way from the user's perspective, they're setting a "global" rate limit and don't have to worry about making these calculations themselves. (Maybe it could be a precondition of using these functions that either --maxNumWorkers is set OR autoscaling is off?)

clairemcginty avatar Dec 08 '21 22:12 clairemcginty

I notice you're talking about improving the user experience with RateLimiterDoFn. I tried it a few months ago. I had an issue to work around (see https://github.com/spotify/scio/issues/3988) but I was able to get it kind of working. I actually ran into some problems after the job had been running for a few hours. It wasn't working well with an extremely low rate limit when I was using it with Pub/Sub and Dataflow on GCP. I ended up getting duplicate messages from Pub/Sub and errors in the logs about stuck threads.

If you like, I can try to reproduce the issue I encountered and create an issue here about it.

mattwelke avatar Dec 16 '21 19:12 mattwelke

@mattwelke Error reproduction always appreciated!

kellen avatar Apr 25 '22 17:04 kellen

~~I'm going to close this.~~ Based on my experience using Dataflow/Beam, it feels like doing rate limiting inside the job doesn't really make sense. The only way it could be implemented would be to store data in state and ack Pub/Sub messages while it does so, and there's no way to back up state, so you'd just end up losing the data. We'll be looking into other solutions for rate limiting.

Edit: I didn't create the issue, so I can't close it. Just withdrawing my name from the feature request.

mattwelke avatar Jun 27 '22 23:06 mattwelke