auto-sklearn
auto-sklearn copied to clipboard
Best practices on runing auto-sklearn on long time budgets
What are the recommeded search space parameters for autosklearn to take maximum benefit from distributed training on multiple CPUs and long time budgets (1day plus)?
Context:
Most of the available examples concentrate on local machine set up and time budgets defined in minutes rather than hours or days.
Let's assume we have a muti-CPU, RAM generous HPC system available, what would be the receommended practices, search space and auto-sklearn configuration parameters, the researches can explore to maximize the auto-sklearn performance ?
Thank you
Hey, here are some suggestions:
- use cross-validation to get more reliable estimates of generalization performance
- increase the
per_run_time_limit
to one or two hours (but in a way that in total ~100-200 configurations could finish if they go up to the runtime limit) I haven't recently tried such large scale runs myself, but these are definitely possible, and we have for example used settings with 50 or more workers running for more than a day for the AutoML challenge.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.
This should be documented as it's likely a use case that users have.
Another question to consider: even if resources are generous, what can be done to ensure most effective use of it?
As @mfeurer mentioned above, really the only difference you can do on the user side is to improve the estimation of the performance of a single algorithm such that a models at the end are considered generally quite strong. This entails giving them more time with per_run_time_limit
and improving the model performance estimation strategy, i.e. cross_validation
with more folds.