cubed icon indicating copy to clipboard operation
cubed copied to clipboard

Estimate monetary cost of executing plan

Open TomNicholas opened this issue 7 months ago • 3 comments

Cubed arguably has enough information to give a rough estimate of the monetary cost of executing the plan before starting execution.

I'm imagining a new method .estimate_cost(executor) that is similar to .compute(executor). Calling this we would know

  • how many arrays are to be processed, how big they all are, and what numpy functions are to be used to process them via the Plan object,
  • which serverless executor the functions are to be run on via the Executor passed,
  • the the temporary intermediate bucket information via the Spec object,

It would just print an estimation of the cost back to the user without running anything, and maybe raise warnings if they are planning to do something that seems obviously expensive (e.g. like having their temporary bucket for intermediate data be in AWS but their executor be GCF).

This means if we had a little table somewhere of e.g. AWS lambda and S3 prices, Cubed could consult those numbers and sum them. It would require an idea of e.g. how long it takes to run np.mean() on a chunk of a certain size on a certain container, but this seems like something that can be discovered fairly straightforwardly.

Obviously there are a long tail of cases where this wouldn't work, but often you might still be able to provide a lower bound cost estimate. For example if your plan had a step that applied some arbitrary function with apply_gufunc, cubed would not know if that was some super expensive function that would run for ever, but it would still be possible to estimate the minimum cost assuming that that function was very light.

TomNicholas avatar Dec 15 '23 16:12 TomNicholas