pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Limit concurrency for caching

Open cisaacstern opened this issue 2 years ago • 5 comments

On the first production run of https://github.com/pangeo-forge/terraclimate-feedstock, Dataflow autoscaled the cluster to 1000 workers, in response to the slow throughput of caching ~882 inputs (totaling ~1.9 TB).

We should be able to limit concurrency for caching, given that the source file servers will generally be bandwidth-constrained. Dataflow provides a max_num_workers option to cap the size of the worker pool, but this issue is separate from that concern: concurrency should be limited only for the caching step, and then we should support larger scale-out after data is cached.

There must be a more formal discussion of this somewhere in the Beam docs, but for now the most direct discussion I've found is in the replies to https://stackoverflow.com/a/65634538, which suggest GroupByKey might be used to achieve this.

I believe this will require pulling caching out from OpenURLWithFSSpec. Currently, if a cache argument is provided to OpenURLWithFSSpec, the input is cached and then immediately opened from the cache

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/bdb32f28d09fb3d2cb76296d988c5dcd64d9d80d/pangeo_forge_recipes/openers.py#L31-L32

In order to limit concurrency for the caching, but not for the opening, I believe caching will need to be its own transform, the output of which is then passed to OpenURLWithFSSpec, which does not do any caching.

cc @rabernat @alxmrs, xref #376

cisaacstern avatar Jul 22 '22 20:07 cisaacstern

Something that I just learned about in Beam is resource hints (https://beam.apache.org/documentation/runtime/resource-hints/). It sounds like this could pair really well with breaking out file caching.

alxmrs avatar Jul 22 '22 21:07 alxmrs

After TAL at OpenURLWifFSSpec, I agree that Caching should be a separate PTransform.

alxmrs avatar Jul 22 '22 21:07 alxmrs

Noting that https://github.com/pangeo-forge/paleo-pism-feedstock/issues/2 is blocked by this. Looks like we will not be able to deploy a production run of that feedstock until we have some way to limit concurrency during the caching stage. cc @jkingslake

cisaacstern avatar Sep 07 '22 18:09 cisaacstern

Here's an example that could help: for this kind of problem we use this RateLimit transform. https://github.com/google/weather-tools/blob/main/weather_mv/loader_pipeline/util.py#L282

alxmrs avatar Sep 08 '22 02:09 alxmrs

Belated thanks for sharing this example, @alxmrs. 🙏

cisaacstern avatar Sep 24 '22 21:09 cisaacstern

Hi all, As always, thanks for all the work you do with these tools!

Any updates on this issue? As noted above, it is stopping progress on https://github.com/pangeo-forge/paleo-pism-feedstock/issues/2

jkingslake avatar Feb 27 '23 14:02 jkingslake

For those still following this, I am now working on a fix in #557

cisaacstern avatar Aug 05 '23 02:08 cisaacstern

This is fixed by #557, @jkingslake please feel free to ping me on your recipe thread if you'd like to work together to revive it in light of this fix. And thanks for your patience!

cisaacstern avatar Aug 18 '23 23:08 cisaacstern