pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Estimate recipe size

Open rabernat opened this issue 3 years ago • 1 comments

It would be very useful to get an estimate of the total size of the target dataset produced by a recipe in GB / TB. For example, this information could be used by bakery managers to decide whether to accept a dataset into their storage.

Here are some different ways we could do this without actually running the whole recipe.

  1. Create a test version of the recipe (see #97) and examine the total size of the test target. Scale up based on the "pruning factor" (what fraction of the full data did the test dataset pull).
  2. Go through each file in the recipe's FilePattern and inspect its size. Sum to get an estimated size. Only works for static file inputs (not APIs like OPeNDAP). May not accurately reflect target size if there is lots of processing involved.
  3. Randomly sample files from the FilePattern and scale up.

rabernat avatar May 13 '21 17:05 rabernat

  1. Create a test version of the recipe (see #97) and examine the total size of the test target. Scale up based on the "pruning factor" (what fraction of the full data did the test dataset pull).

Are there known reasons why this is not the obvious best direction to pursue? It seems to dovetail nicely with other objectives, and should be relatively accurate, assuming the as-yet-unimplemented prune method referenced in https://github.com/pangeo-forge/staged-recipes/pull/28#issuecomment-829482555 is "prune factor"-aware.

cisaacstern avatar May 13 '21 18:05 cisaacstern