automlr icon indicating copy to clipboard operation
automlr copied to clipboard

#+TITLE: Automatic Machine Learning Algorithm Configuration in R

  • Introduction automlr is an R-package for automatically configuring mlr machine learning algorithms so that they perform well. It is designed for simplicity of use and able to run with minimal user intervention.

~automlr~ is currently under development. You can see the current status in the [[https://github.com/mlr-org/automlr/tree/develop]['develop' branch]], but that branch may or may not be functional.

  • What does ~automlr~ offer? ~automlr~ complements ~mlr~ to make optimization over multiple learners and their respective hyperparameters possible. It offers:
  • A database of sensible hyperparameter search ranges for many ~mlr~ learners and preprocessing operators.
  • A process of combining these learners and preprocessing operations into a single ~Learner~ object, and the respective hyperparameters into a single searchspace.
  • A unified interface to optimization algorithms designed to enable easy optimization run continuations.
  • Installation /Note: Installation of automlr is currently only tested on Linux systems. Installation on other systems, especially MS Windows, might not work./

Installation will fail if roxygen2 is not installed, so make sure it is: #+BEGIN_SRC R if (!require("roxygen2")) install.packages("roxygen2") #+END_SRC ~automlr~ furthermore depends on [[https://github.com/mlr-org/mlr/pull/1827][my CPO extension to mlr]], which must be installed from github. Install my private branch which is kept in a state compatible to ~automlr~. #+BEGIN_SRC R devtools::install_github("mb706/mlr", ref = "mb706_CPO") #+END_SRC Then you can install ~automlr~ from source. ~automlr~ can work with many mlr learners; however, many of them have package dependencies. ~automlr~ doesn't /need/ these packages and will skip learners that are not installed, but without them, the search space will be incomplete (and a warning will be given). To install ~automlr~ and all referenced learner packages (this can take a while!), do #+BEGIN_SRC R devtools::install_github("mlr-org/automlr", dependencies = c("Depends", "Imports", "Suggests")) #+END_SRC Alternatively, to only install ~automlr~ (and its essential dependencies), do #+BEGIN_SRC R devtools::install_github("mlr-org/automlr") #+END_SRC

It is highly recommended to install my forks of the ~e1071~, ~kernlab~ and ~mda~ packages, which fix bugs that otherwise regularly lead to R crashing or hanging: #+BEGIN_SRC R devtools::install_github("mb706/e1071") devtools::install_github("mb706/kernlab") devtools::install_github("mb706/mda") #+END_SRC

  • Usage To run a small example to fit some learners on the mlr-provided ~pid.task~, execute #+BEGIN_SRC R library("mlr") library("automlr")

depending on your RNG luck, this can take tens of minutes

amrun = automlr(pid.task, backend = "random", budget = c(evals = 10), verbosity = 1) result = amfinish(amrun) print(result, verbose = TRUE) #+END_SRC This already shows all the mandatory arguments of ~automlr~: The task for which to optimize, the backend to use (may be "random", "irace" or "mbo"), and a computational budget. The resulting object can be given to another ~automlr~ call with a different budget to continue optimizing, or to ~amfinish~ to finalize the run.

You can subset the search space like so: #+BEGIN_SRC R amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10), searchspace = list(mlrLearners$classif.randomForest, mlrLearners$classif.svm)) #+END_SRC or #+BEGIN_SRC R amrun2 = automlr(pid.task, backend = "random", budget = c(evals = 10), searchspace = list(mlrLearners[c("classif.randomForest", "classif.svm")])) #+END_SRC

The functions and data exported by ~automlr~ that will be of interest to the user:

  • ~automlr~ invocation
    • automlr :: The main entry point; can be called with a task and a backend, or with an object that was returned by a previous ~automlr~ invocation, or even with a file name that was used by ~automlr~ to save the state. The user can choose:
      • which backend to use (~backend~)
      • the computational budget (~budget~)
      • a possible savefile (~savefile~) and the interval in which to save to a file (~save.interval~)
      • a measure for which to optimize, if not the task's default measure (~measure~)
      • the verbosity level (~verbosity~) ranging from 0 to 6, 0 being the least verbose. Level 6 also stops the optimization process when a learner returns an error.
    • makeBackendconfRandom, makeBackendconfMbo, makeBackendconfIrace :: Make backend configuration objects to use instead of the backend strings ("random" etc.)
    • amfinish :: Generates an ~AMResult~ object that contains information about the optimization result.
    • mlrLearners, mlrLearnersNoWrappers :: A collection of mlr learners with corresponding search space. ~mlrLearnersNoWrappers~ does not contain preprocessing wrappers.
    • mlrLightweight, mlrLightweightNoWrappers :: Similar to ~mlrLearners~ and ~mlrLearnersNoWrappers~; these are search spaces, but with the slowest learners removed. This decreases evaluation time and is also necessary for the "mbo" backend to work.
  • searchspace definition
    • autolearner :: define your own mlr learner to put in a search space
    • autoWrapper :: define an mlr wrapper to use in a search space
    • sp :: for defining parameters that are given to ~autolearner~ See their [[https://github.com/mlr-org/automlr/raw/release/automlr-manual.pdf][respective R documentation]] for more information and additional arguments.
  • Troubleshooting ** Segfaults Unfortunately some learners, especially ones that use native code, may crash the whole R session. Also, apparently a recent linux kernel release [[https://github.com/s-u/rJava/issues/110][caused problems with rJava packages]]. If you see segfaults happening, try the following:
  • Run ~export _JAVA_OPTIONS="-Xss2560k -Xmx2g";~ before running R; alternatively, run ~options(java.parameters = c("-Xss2560k", "-Xmx2g"))~ at the beginning of your R session. This may help even if the crash happens in a non-java learner.
  • Use ~setDefaultRWTBackend("fork")~. This causes all learners to be run in a separate process. See the issue concerning the "fork" backend, however.
  • Run ~automlr~ with a small value for ~save.interval~ and have a process in place to resurrect R after a segfault with the savefile. ** Timeout Overrun The default "native" backend of interrupting learners that run over time is not able to stop learners that take a long time in native (C/Fortran) code routines. Use ~setDefaultRWTBackend("fork")~ to kill slow learners effectively, at the cost of some performance. However, see the following issue. ** setDefaultRWTBackend("fork") causes hangs This happens if you use ~automlr~ with the "fork" backend and a learner uses java. Currently, there is no way of using the fork backend with java based learners. Use the ~mlrLightweightNoJava~ searchspace to exclude all java based learners. ** Empty result when using "walltime" budget If you are running ~automlr~ with "walltime" budget, beware that a hard execution time limit is set to 10% of the walltime budget + 10 minutes, after which the current ~irace~ or ~mlrMBO~ cycle is killed. To avoid this behaviour, set ~max.walltime.overrun~ to a larger value, possibly ~Inf~.

** Optimization Takes Too Long Unfortunately, the runtime of different learners varies widely. To exclude the most problematic learners, use ~searchspace = mlrLightweight~ when calling ~automlr~.

If a single evaluation is stuck in a a loop and does not finish, it is possible that this is a bug in the learner. If you can provide useful information about a bug, please open an "Issue" on github. Gather this information using ~gdb~ or your debugger of choice (if you know your way around one); otherwise try to find a way to reproduce the behaviour. I (and probably the learner package's developer) are very happy to track down and fix these kind of bugs.

** Maximal Number of DLLs reached This is because R is very conservative on how many DLLs it allows to be loaded. If you are using R >= 3.4, one solution is to set the environmental variable R_MAX_NUM_DLLS to something greater than 100, as found out [[https://stackoverflow.com/a/43689526][here]]. Otherwise, reduce the number of learners you are using in your searchspace.

If you are doing this, also take care that your ulimit -n might need adjusting.

  • Project status The project is currently undergoing heavy development; while the spirit of the application is expected to be stable, the user interface may undergo slight changes in the future. Expect the internals of ~automlr~ to be changing regularly.

  • Notes

  • The "irace" backend's behaviour deviates slightly from that of the ~irace~ package in so far that the number of evaluations per generation, and the slimming of the sampling distribution, are independent of the budget.
  • The "mbo" backend currently uses an inferior imputation method for the surrogate model, and its performance should not be seen as representative for ~mlrMBO~.
  • for tasks with tens of features and thousands of rows, expect ~automlr~ to use about 0.5-2MB of memory per row of data.
  • Project TODO (under consideration, subject to change)
  • [ ] release 0.3
    • [ ] integration of wrapper CPOs
  • [ ] release 0.4
    • [ ] nicer printing of results
    • [ ] consistent randomness
      • [ ] test that execution with same seed gets same result
      • [ ] use seeds in learners that use external RNGs
    • [ ] memory handling
    • [ ] searchspace
      • [ ] respect parameter equality IDs
      • [ ] automatically recognize absence of learner (in a hypothetical future mlr version) and don't throw an error
    • [ ] tests
      • [ ] 100% test coverage
      • [ ] test for all possible wrong arguments
      • [ ] other things?
    • [ ] regression learners
    • [ ] installation on Win32
    • [ ] more empirical grounding for mlrLightweight.
  • [ ] release 0.5
    • [ ] more sophisticated search space extensions
      • [ ] metalearner wrappers
  • [ ] release 0.6
    • [ ] cleaning up
      • [ ] Consistent solution for timeouts, the current one is not stable
      • [ ] Remove Ctrl-C handler, R does not work like this
    • [ ] CPOs
      • [ ] do CPO wrapping the correct way
      • [ ] use Meta-CPO
      • [ ] make CPO types etc. work together
  • [ ] release 1.0
    • [ ] everything is really, really stable
  • [ ] possible future releases
    • [ ] other backends?
    • [ ] simultaneous multiple task optimization
    • [ ] batchJobs integration? (e.g. break run down into smaller jobs automatically)
    • [ ] priors for learners?

** COMMENT also TODO The [X] blocks still need testing.

  • [ ] experimental setup
  • [ ] check what is taking random so long with some evals
  • [ ] error imputation wrapper
    • [ ] make the imputation result wrapper work
    • [ ] get some way to communicate nature of error
  • [ ] check automlr option handling
  • [ ] mbo user.extras: add debug dump