#+TITLE: Automatic Machine Learning Algorithm Configuration in R

Introduction automlr is an R-package for automatically configuring mlr machine learning algorithms so that they perform well. It is designed for simplicity of use and able to run with minimal user intervention.

~automlr~ is currently under development. You can see the current status in the [[https://github.com/mlr-org/automlr/tree/develop]['develop' branch]], but that branch may or may not be functional.

What does ~automlr~ offer? ~automlr~ complements ~mlr~ to make optimization over multiple learners and their respective hyperparameters possible. It offers:

A database of sensible hyperparameter search ranges for many ~mlr~ learners and preprocessing operators.
A process of combining these learners and preprocessing operations into a single ~Learner~ object, and the respective hyperparameters into a single searchspace.
A unified interface to optimization algorithms designed to enable easy optimization run continuations.

Installation /Note: Installation of automlr is currently only tested on Linux systems. Installation on other systems, especially MS Windows, might not work./

Installation will fail if roxygen2 is not installed, so make sure it is: #+BEGIN_SRC R if (!require("roxygen2")) install.packages("roxygen2") #+END_SRC ~automlr~ furthermore depends on [[https://github.com/mlr-org/mlr/pull/1827][my CPO extension to mlr]], which must be installed from github. Install my private branch which is kept in a state compatible to ~automlr~. #+BEGIN_SRC R devtools::install_github("mb706/mlr", ref = "mb706_CPO") #+END_SRC Then you can install ~automlr~ from source. ~automlr~ can work with many mlr learners; however, many of them have package dependencies. ~automlr~ doesn't /need/ these packages and will skip learners that are not installed, but without them, the search space will be incomplete (and a warning will be given). To install ~automlr~ and all referenced learner packages (this can take a while!), do #+BEGIN_SRC R devtools::install_github("mlr-org/automlr", dependencies = c("Depends", "Imports", "Suggests")) #+END_SRC Alternatively, to only install ~automlr~ (and its essential dependencies), do #+BEGIN_SRC R devtools::install_github("mlr-org/automlr") #+END_SRC

It is highly recommended to install my forks of the ~e1071~, ~kernlab~ and ~mda~ packages, which fix bugs that otherwise regularly lead to R crashing or hanging: #+BEGIN_SRC R devtools::install_github("mb706/e1071") devtools::install_github("mb706/kernlab") devtools::install_github("mb706/mda") #+END_SRC

Usage To run a small example to fit some learners on the mlr-provided ~pid.task~, execute #+BEGIN_SRC R library("mlr") library("automlr")

depending on your RNG luck, this can take tens of minutes

amrun = automlr(pid.task, backend = "random", budget = c(evals = 10), verbosity = 1) result = amfinish(amrun) print(result, verbose = TRUE) #+END_SRC This already shows all the mandatory arguments of ~automlr~: The task for which to optimize, the backend to use (may be "random", "irace" or "mbo"), and a computational budget. The resulting object can be given to another ~automlr~ call with a different budget to continue optimizing, or to ~amfinish~ to finalize the run.

You can subset the search space like so: #+BEGIN_SRC R amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10), searchspace = list(mlrLearners$classif.randomForest, mlrLearners$classif.svm)) #+END_SRC or #+BEGIN_SRC R amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10), searchspace = list(mlrLearners[c("classif.randomForest", "classif.svm")])) #+END_SRC

The functions and data exported by ~automlr~ that will be of interest to the user:

~automlr~ invocation
- automlr :: The main entry point; can be called with a task and a backend, or with an object that was returned by a previous ~automlr~ invocation, or even with a file name that was used by ~automlr~ to save the state. The user can choose:
  - which backend to use (~backend~)
  - the computational budget (~budget~)
  - a possible savefile (~savefile~) and the interval in which to save to a file (~save.interval~)
  - a measure for which to optimize, if not the task's default measure (~measure~)
  - the verbosity level (~verbosity~) ranging from 0 to 6, 0 being the least verbose. Level 6 also stops the optimization process when a learner returns an error.
- makeBackendconfRandom, makeBackendconfMbo, makeBackendconfIrace :: Make backend configuration objects to use instead of the backend strings ("random" etc.)
- amfinish :: Generates an ~AMResult~ object that contains information about the optimization result.
- mlrLearners, mlrLearnersNoWrappers :: A collection of mlr learners with corresponding search space. ~mlrLearnersNoWrappers~ does not contain preprocessing wrappers.
- mlrLightweight, mlrLightweightNoWrappers :: Similar to ~mlrLearners~ and ~mlrLearnersNoWrappers~; these are search spaces, but with the slowest learners removed. This decreases evaluation time and is also necessary for the "mbo" backend to work.
searchspace definition
- autolearner :: define your own mlr learner to put in a search space
- autoWrapper :: define an mlr wrapper to use in a search space
- sp :: for defining parameters that are given to ~autolearner~ See their [[https://github.com/mlr-org/automlr/raw/release/automlr-manual.pdf][respective R documentation]] for more information and additional arguments.

Troubleshooting ** Segfaults Unfortunately some learners, especially ones that use native code, may crash the whole R session. Also, apparently a recent linux kernel release [[https://github.com/s-u/rJava/issues/110][caused problems with rJava packages]]. If you see segfaults happening, try the following:

Run ~export _JAVA_OPTIONS="-Xss2560k -Xmx2g";~ before running R; alternatively, run ~options(java.parameters = c("-Xss2560k", "-Xmx2g"))~ at the beginning of your R session. This may help even if the crash happens in a non-java learner.
Use ~setDefaultRWTBackend("fork")~. This causes all learners to be run in a separate process. See the issue concerning the "fork" backend, however.
Run ~automlr~ with a small value for ~save.interval~ and have a process in place to resurrect R after a segfault with the savefile. ** Timeout Overrun The default "native" backend of interrupting learners that run over time is not able to stop learners that take a long time in native (C/Fortran) code routines. Use ~setDefaultRWTBackend("fork")~ to kill slow learners effectively, at the cost of some performance. However, see the following issue. ** setDefaultRWTBackend("fork") causes hangs This happens if you use ~automlr~ with the "fork" backend and a learner uses java. Currently, there is no way of using the fork backend with java based learners. Use the ~mlrLightweightNoJava~ searchspace to exclude all java based learners. ** Empty result when using "walltime" budget If you are running ~automlr~ with "walltime" budget, beware that a hard execution time limit is set to 10% of the walltime budget + 10 minutes, after which the current ~irace~ or ~mlrMBO~ cycle is killed. To avoid this behaviour, set ~max.walltime.overrun~ to a larger value, possibly ~Inf~.

** Optimization Takes Too Long Unfortunately, the runtime of different learners varies widely. To exclude the most problematic learners, use ~searchspace = mlrLightweight~ when calling ~automlr~.

If a single evaluation is stuck in a a loop and does not finish, it is possible that this is a bug in the learner. If you can provide useful information about a bug, please open an "Issue" on github. Gather this information using ~gdb~ or your debugger of choice (if you know your way around one); otherwise try to find a way to reproduce the behaviour. I (and probably the learner package's developer) are very happy to track down and fix these kind of bugs.

** Maximal Number of DLLs reached This is because R is very conservative on how many DLLs it allows to be loaded. If you are using R >= 3.4, one solution is to set the environmental variable R_MAX_NUM_DLLS to something greater than 100, as found out [[https://stackoverflow.com/a/43689526][here]]. Otherwise, reduce the number of learners you are using in your searchspace.

If you are doing this, also take care that your ulimit -n might need adjusting.

Project status The project is currently undergoing heavy development; while the spirit of the application is expected to be stable, the user interface may undergo slight changes in the future. Expect the internals of ~automlr~ to be changing regularly.
Notes

The "irace" backend's behaviour deviates slightly from that of the ~irace~ package in so far that the number of evaluations per generation, and the slimming of the sampling distribution, are independent of the budget.
The "mbo" backend currently uses an inferior imputation method for the surrogate model, and its performance should not be seen as representative for ~mlrMBO~.
for tasks with tens of features and thousands of rows, expect ~automlr~ to use about 0.5-2MB of memory per row of data.

Project TODO (under consideration, subject to change)

[ ] release 0.3
- [ ] integration of wrapper CPOs
[ ] release 0.4
- [ ] nicer printing of results
- [ ] consistent randomness
  - [ ] test that execution with same seed gets same result
  - [ ] use seeds in learners that use external RNGs
- [ ] memory handling
- [ ] searchspace
  - [ ] respect parameter equality IDs
  - [ ] automatically recognize absence of learner (in a hypothetical future mlr version) and don't throw an error
- [ ] tests
  - [ ] 100% test coverage
  - [ ] test for all possible wrong arguments
  - [ ] other things?
- [ ] regression learners
- [ ] installation on Win32
- [ ] more empirical grounding for mlrLightweight.
[ ] release 0.5
- [ ] more sophisticated search space extensions
  - [ ] metalearner wrappers
[ ] release 0.6
- [ ] cleaning up
  - [ ] Consistent solution for timeouts, the current one is not stable
  - [ ] Remove Ctrl-C handler, R does not work like this
- [ ] CPOs
  - [ ] do CPO wrapping the correct way
  - [ ] use Meta-CPO
  - [ ] make CPO types etc. work together
[ ] release 1.0
- [ ] everything is really, really stable
[ ] possible future releases
- [ ] other backends?
- [ ] simultaneous multiple task optimization
- [ ] batchJobs integration? (e.g. break run down into smaller jobs automatically)
- [ ] priors for learners?

** COMMENT also TODO The [X] blocks still need testing.

[ ] experimental setup
[ ] check what is taking random so long with some evals
[ ] error imputation wrapper
- [ ] make the imputation result wrapper work
- [ ] get some way to communicate nature of error
[ ] check automlr option handling
[ ] mbo user.extras: add debug dump

automlr
automlr copied to clipboard

Metadata

depending on your RNG luck, this can take tens of minutes

← Metadata

Owner

Metadata

automlr automlr copied to clipboard

Metadata

depending on your RNG luck, this can take tens of minutes

← Metadata

Owner

Metadata

automlr
automlr copied to clipboard