automlbenchmark icon indicating copy to clipboard operation
automlbenchmark copied to clipboard

Why can't I use my GPUs?

Open alanwilter opened this issue 2 years ago • 10 comments

Even with docker, it's possible to run with GPUs via https://github.com/NVIDIA/nvidia-docker.

I see that requirements files don't have pytorch or tensorflow. Is it intentional?

I don't find any mention to GPUs in the documentation either.

Or am I missing something?

alanwilter avatar May 18 '22 12:05 alanwilter

@alanwilter which framework are you trying to use exactly? Detection and usage of GPUs will depends on the framework.

I see that requirements files don't have pytorch or tensorflow. Is it intentional?

If you mean https://github.com/openml/automlbenchmark/blob/master/requirements.txt, these are the requirements for the amlb app, not for any of the framework that you can currently run by default. Some of those frameworks may require pytorch for example and in this case, it's their responsibility to install it—using the frameworks/xxx/setup.sh—in their dedicated virtual env.

Even with docker, it's possible to run with GPUs via https://github.com/NVIDIA/nvidia-docker.

the docker images we build by default don't include the nvidia drivers, it's something that would need to be done in the framework's setup.sh if it wants to leverage GPUs.

sebhrusen avatar May 18 '22 12:05 sebhrusen

I'm just testing for the time being. Which framework do you suggest if I want to use/test with our GPUs?

As for docker, we simply add --gpus N to docker run ... cmd, assuming, of course, that the container has pytorch/tensorflow etc. installed. (Of course, one needs to read the instructions and setup accordingly)

The nvidia-docker is there precisely to avoid you mendling with your docker container.

alanwilter avatar May 18 '22 13:05 alanwilter

Looking at the Frameworks in detail and I only found:

  • frameworks/MLPlan : torch>=1.6.0,<1.7.0

Is that so?

alanwilter avatar May 18 '22 13:05 alanwilter

I tried to run it at stable-v2 branch, local and docker, both failed with an error:

python3 runbenchmark.py MLPlanSKLearn -m docker
...
Download ML-Plan from extern

Successfully built docker image automlbenchmark/mlplan:stable-dev.

----------------------------------------------------------------
Starting job docker.test.test.all_tasks.all_folds.MLPlanSKLearn.
[MONITORING] [docker.test.test.all_tasks.all_folds.MLPlanSKLearn] CPU Utilization: 77.1%
Starting docker: docker run --name mlplansklearn.test.test.docker.20220518T183639.zIBG03NxB4sbsopS82G7nA__ --shm-size=2048M -v /mnt/data/awilter/cache/openml:/input -v /mnt/data/awilter/automlbenchmark/results/mlplansklearn.test.test.docker.20220518T183639:/output -v /home/awilter/.config/automlbenchmark:/custom --rm automlbenchmark/mlplan:stable-dev MLPlanSKLearn test test   -Xseed=auto -i /input -o /output -u /custom -s skip -Xrun_mode=docker --session=.
[MONITORING] [docker.test.test.all_tasks.all_folds.MLPlanSKLearn] Memory Usage: 17.9%
Datasets are loaded by default from folder /mnt/data/awilter/cache/openml.
Generated files will be available in folder /mnt/data/awilter/automlbenchmark/results.
[MONITORING] [docker.test.test.all_tasks.all_folds.MLPlanSKLearn] Disk Usage: 84.0%
Running cmd `docker run --name mlplansklearn.test.test.docker.20220518T183639.zIBG03NxB4sbsopS82G7nA__ --shm-size=2048M -v /mnt/data/awilter/cache/openml:/input -v /mnt/data/awilter/automlbenchmark/results/mlplansklearn.test.test.docker.20220518T183639:/output -v /home/awilter/.config/automlbenchmark:/custom --rm automlbenchmark/mlplan:stable-dev MLPlanSKLearn test test   -Xseed=auto -i /input -o /output -u /custom -s skip -Xrun_mode=docker --session=`
Unable to find image 'automlbenchmark/mlplan:stable-dev' locally
docker: Error response from daemon: manifest for automlbenchmark/mlplan:stable-dev not found: manifest unknown: manifest unknown.
See 'docker run --help'.
Running cmd `docker kill mlplansklearn.test.test.docker.20220518T183639.zIBG03NxB4sbsopS82G7nA__`
Error response from daemon: Cannot kill container: mlplansklearn.test.test.docker.20220518T183639.zIBG03NxB4sbsopS82G7nA__: No such container: mlplansklearn.test.test.docker.20220518T183639.zIBG03NxB4sbsopS82G7nA__

Job `docker.test.test.all_tasks.all_folds.MLPlanSKLearn` failed with error: Command 'docker run --name mlplansklearn.test.test.docker.20220518T183639.zIBG03NxB4sbsopS82G7nA__ --shm-size=2048M -v /mnt/data/awilter/cache/openml:/input -v /mnt/data/awilter/automlbenchmark/results/mlplansklearn.test.test.docker.20220518T183639:/output -v /home/awilter/.config/automlbenchmark:/custom --rm automlbenchmark/mlplan:stable-dev MLPlanSKLearn test test   -Xseed=auto -i /input -o /output -u /custom -s skip -Xrun_mode=docker --session=' returned non-zero exit status 125.
Traceback (most recent call last):
  File "/mnt/data/awilter/automlbenchmark/amlb/job.py", line 115, in start
    result = self._run()
  File "/mnt/data/awilter/automlbenchmark/amlb/runners/container.py", line 108, in _run
    self._start_container("{framework} {benchmark} {constraint} {task_param} {folds_param} -Xseed={seed}".format(
  File "/mnt/data/awilter/automlbenchmark/amlb/runners/docker.py", line 73, in _start_container
    run_cmd(cmd, _capture_error_=False)  # console logs are written on stderr by default: not capturing allows live display
  File "/mnt/data/awilter/automlbenchmark/amlb/utils/process.py", line 245, in run_cmd
    raise e
  File "/mnt/data/awilter/automlbenchmark/amlb/utils/process.py", line 219, in run_cmd
    completed = run_subprocess(str_cmd if params.shell else full_cmd,
  File "/mnt/data/awilter/automlbenchmark/amlb/utils/process.py", line 77, in run_subprocess
    raise subprocess.CalledProcessError(retcode, process.args, output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'docker run --name mlplansklearn.test.test.docker.20220518T183639.zIBG03NxB4sbsopS82G7nA__ --shm-size=2048M -v /mnt/data/awilter/cache/openml:/input -v /mnt/data/awilter/automlbenchmark/results/mlplansklearn.test.test.docker.20220518T183639:/output -v /home/awilter/.config/automlbenchmark:/custom --rm automlbenchmark/mlplan:stable-dev MLPlanSKLearn test test   -Xseed=auto -i /input -o /output -u /custom -s skip -Xrun_mode=docker --session=' returned non-zero exit status 125.
All jobs executed in 1.646 seconds.
[MONITORING] [docker.test.test.all_tasks.all_folds.MLPlanSKLearn] CPU Utilization: 65.3%
[MONITORING] [docker.test.test.all_tasks.all_folds.MLPlanSKLearn] Memory Usage: 17.8%
[MONITORING] [docker.test.test.all_tasks.all_folds.MLPlanSKLearn] Disk Usage: 83.9%

Essentially, the docker image automlbenchmark/mlplan:stable-dev is not created.

alanwilter avatar May 18 '22 21:05 alanwilter

The reason MLPlanSKLearn is failing (and any based on MLPLan) is because mlplan.org is down.

File mlplan.zip is never downloaded. Perhaps this framework setup need some update. Perhaps more here? https://mavenlibs.com/maven/dependency/ai.libs/mlplan-full

Anyway, all I wanted is an example framework that would use GPU.

alanwilter avatar May 19 '22 07:05 alanwilter

@mwever does the installation script need an update or will Mlplan.org be back up?

PGijsbers avatar Jun 02 '22 08:06 PGijsbers

Oh, @fmohr is maintaining this server for mlplan.org, I will notify him abou the outage but it should be back up soon.

Regarding use of GPU: ML-Plan does not make use of GPU resources. Honestly speaking, I currently do not remember why we included the torch package at all, must be some technical stuff but we only build pipelines with scikit-learn algorithms and xgboost - that's it. So, I am sorry to disappoint you @alanwilter , but ML-Plan is no reference for a framework using GPUs.

mwever avatar Jun 02 '22 10:06 mwever

Also, there are (other) packages that have GPU support (e.g., for xgboost) like autogluon or H2O (I think). Though I don't know how accessible their configurations are from the benchmark framework (forwarding configurations from the framework definitions has some limitations at the moment that also depend on the respective framework integration).

PGijsbers avatar Jun 02 '22 11:06 PGijsbers

Sorry for late reply (was off for a month): @PGijsbers is right for H2O, it will detect GPUs and use them for xgboost. I don't know how it works with AutoGluon.

sebhrusen avatar Jun 21 '22 17:06 sebhrusen

AutoGluon will autodetect and use gpu only if hyperparameters='multimodal' for multimodal text tab image data. You can also force GPUs for other models like lightgbm xgboost catboost via referring to website tutorials auto.gluon.ai , but not easy to do with how AMLB currently works

Innixma avatar Jul 27 '22 13:07 Innixma