pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard
Limit concurrency for caching
On the first production run of https://github.com/pangeo-forge/terraclimate-feedstock, Dataflow autoscaled the cluster to 1000 workers, in response to the slow throughput of caching ~882 inputs (totaling ~1.9 TB).
We should be able to limit concurrency for caching, given that the source file servers will generally be bandwidth-constrained. Dataflow provides a max_num_workers
option to cap the size of the worker pool, but this issue is separate from that concern: concurrency should be limited only for the caching step, and then we should support larger scale-out after data is cached.
There must be a more formal discussion of this somewhere in the Beam docs, but for now the most direct discussion I've found is in the replies to https://stackoverflow.com/a/65634538, which suggest GroupByKey
might be used to achieve this.
I believe this will require pulling caching out from OpenURLWithFSSpec
. Currently, if a cache
argument is provided to OpenURLWithFSSpec
, the input is cached and then immediately opened from the cache
https://github.com/pangeo-forge/pangeo-forge-recipes/blob/bdb32f28d09fb3d2cb76296d988c5dcd64d9d80d/pangeo_forge_recipes/openers.py#L31-L32
In order to limit concurrency for the caching, but not for the opening, I believe caching will need to be its own transform, the output of which is then passed to OpenURLWithFSSpec
, which does not do any caching.
cc @rabernat @alxmrs, xref #376
Something that I just learned about in Beam is resource hints (https://beam.apache.org/documentation/runtime/resource-hints/). It sounds like this could pair really well with breaking out file caching.
After TAL at OpenURLWifFSSpec
, I agree that Caching should be a separate PTransform.
Noting that https://github.com/pangeo-forge/paleo-pism-feedstock/issues/2 is blocked by this. Looks like we will not be able to deploy a production run of that feedstock until we have some way to limit concurrency during the caching stage. cc @jkingslake
Here's an example that could help: for this kind of problem we use this RateLimit
transform. https://github.com/google/weather-tools/blob/main/weather_mv/loader_pipeline/util.py#L282
Belated thanks for sharing this example, @alxmrs. 🙏
Hi all, As always, thanks for all the work you do with these tools!
Any updates on this issue? As noted above, it is stopping progress on https://github.com/pangeo-forge/paleo-pism-feedstock/issues/2
For those still following this, I am now working on a fix in #557
This is fixed by #557, @jkingslake please feel free to ping me on your recipe thread if you'd like to work together to revive it in light of this fix. And thanks for your patience!