auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

Best practices on runing auto-sklearn on long time budgets

Open Oli2 opened this issue 4 years ago • 5 comments

What are the recommeded search space parameters for autosklearn to take maximum benefit from distributed training on multiple CPUs and long time budgets (1day plus)?

Context:

Most of the available examples concentrate on local machine set up and time budgets defined in minutes rather than hours or days.

Let's assume we have a muti-CPU, RAM generous HPC system available, what would be the receommended practices, search space and auto-sklearn configuration parameters, the researches can explore to maximize the auto-sklearn performance ?

Thank you

Oli2 avatar Apr 28 '20 12:04 Oli2

Hey, here are some suggestions:

  • use cross-validation to get more reliable estimates of generalization performance
  • increase the per_run_time_limit to one or two hours (but in a way that in total ~100-200 configurations could finish if they go up to the runtime limit) I haven't recently tried such large scale runs myself, but these are definitely possible, and we have for example used settings with 50 or more workers running for more than a day for the AutoML challenge.

mfeurer avatar Jun 09 '20 11:06 mfeurer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.

github-actions[bot] avatar May 05 '21 01:05 github-actions[bot]

This should be documented as it's likely a use case that users have.

eddiebergman avatar Sep 03 '21 11:09 eddiebergman

Another question to consider: even if resources are generous, what can be done to ensure most effective use of it?

BradKML avatar Oct 19 '22 10:10 BradKML

As @mfeurer mentioned above, really the only difference you can do on the user side is to improve the estimation of the performance of a single algorithm such that a models at the end are considered generally quite strong. This entails giving them more time with per_run_time_limit and improving the model performance estimation strategy, i.e. cross_validation with more folds.

eddiebergman avatar Oct 20 '22 07:10 eddiebergman