pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Maybe deprecate pattern_from_file_sequence

Open TomAugspurger opened this issue 4 years ago • 3 comments

See https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160#issuecomment-864075050 for justification, but the tl/dr is that for large recipes (many inputs), filepatterns_from_sequence can make large function objects.

So while filepatterns_from_sequence is very convenient, I think we're best off guiding users to writing a function that generates the filepattern for an input index dynamically, rather than looking it up from a list. Otherwise recipe reviewers will need to be on the lookout for uses of it when there are a large number of inputs.

TomAugspurger avatar Jun 24 '21 19:06 TomAugspurger

(Changed issue name per https://github.com/pangeo-forge/staged-recipes/issues/58#issuecomment-870157435.)

I do have a few open PRs which use pattern_from_file_sequence:

  • https://github.com/pangeo-forge/staged-recipes/pull/26
  • https://github.com/pangeo-forge/staged-recipes/pull/27
  • https://github.com/pangeo-forge/staged-recipes/pull/29

but I'm not especially attached to this method.

In fact, any convenience lost by removing this method is likely offset by the standardization benefit derived from making it so there is, in the words of PEP 20, "... one-- and preferably only one --obvious way to do it."

cisaacstern avatar Jun 29 '21 18:06 cisaacstern

Since my last comment on this issue I have found many, many situations where pattern_from_file_sequence has been remarkably helpful, usually for recipes with on the order of 3 - 20 input files. I'd therefore like to amend my previous comment

but I'm not especially attached to this

to say that I am in favor of keeping pattern_from_file_sequence.

Otherwise recipe reviewers will need to be on the lookout for uses of it when there are a large number of inputs.

Maybe we can build some automated checking for this, to minimize toil for maintainers.

cisaacstern avatar Sep 08 '21 21:09 cisaacstern

The only real problem with pattern_from_file_sequence is that it may contribute to the task-graph memory issues that we have struggled with in the past.

The way to resolve this would be to develop quantitative metrics and tests for those memory issues, rather than just waiting to hear that a particular recipe made the bakery crash. This was the point I was trying to make in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160#issuecomment-881614701. Without such metrics, we are flying blind.

rabernat avatar Sep 09 '21 00:09 rabernat