auto-sklearn
auto-sklearn copied to clipboard
Error running "fit" with many cores.
Hi! I'm experiencing a problem when I fit an AutoSklearn instance in a virtual machine with many cores.
I have run exactly the same code, with the same dataset in three different virtual machines:
in a vm with 4 cores and 15Gb of RAM: works ok ✅ in a vm with 8 cores and 30Gb of RAM: works ok ✅ in a vm with 40 cores and 157 Gb of RAM: fails ❌ with the following error:
ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap\n self.run()\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run\n self._target(*self._args, **self._kwargs)\n File "/usr/local/lib/python3.7/site-packages/pynisher/limit_function_call.py", line 133, in subprocess_func\n return_value = ((func(*args, **kwargs), 0))\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/__init__.py", line 40, in fit_predict_try_except_decorator\n return ta(queue=queue, **kwargs)\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 1164, in eval_holdout\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 194, in __init__\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 199, in __init__\n threadpool_limits(limits=1)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 171, in __init__\n self._original_info = self._set_threadpool_limits()\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 280, in _set_threadpool_limits\n module.set_num_threads(num_threads)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 659, in set_num_threads\n return set_func(num_threads)\nKeyboardInterrupt\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.
This is the code I was running:
automl = AutoSklearnClassifier(time_left_for_this_task=600, metric=roc_auc)
automl.fit(x_train, y_train, x_validation, y_validation)
Limiting the number of cores with the param nproc
seems to work, but it's a pity that we cannot take advantage of larger infra :(
The dataset doesn't seem to be the problem. I reproduced the bug with datasets of different sizes and different feature types, and everytime it raises the same error (it's not something that happens stochastically).
Also, the error is almost instantaneous: clearly it doesn't even start to fit when it fails.
Environment and installation:
- OS: linux
- Python version: 3.7
- Auto-sklearn version: 0.13.0
The workaround I found to fix this issue is to limit the number of cores with the env var OPENBLAS_NUM_THREADS
before importing anything from autosklearn.
For example:
import os
os.environ["OPENBLAS_NUM_THREADS"] = "8"
from autosklearn(...)
hello. I'm having a similar issue, and that solution does not work for me.
I'm running auto-sklearn = "0.14.0"
on MacBook 16 cores (not M1)
Hi @sofidenner,
We don't have infrastructure (a machine with that many cores) to actually test this properly which makes this difficult but we just want to write here to say we are aware of the issue and sorry that we have no response as of yet.
Hi @sofidenner and others,
can you make sure that the resources that you are providing for fit()
are actually available?
I managed to work around this error by freeing up resources on my machine.
However, this could also be a coincidence because this error also occurs occasionally for me.
Hi @felidsche, in my case the resources are not the problem: having more than 150GB of RAM free, and running an experiment with an incredibly small dataset results in the same error, every time I run it.
This is the snippet with the incredibly small dataset that I just use to try it out:
import pandas
from autosklearn.estimators import AutoSklearnClassifier
train_x = pandas.DataFrame(
{"column1": [1, 2, 3, 10, 20, 30]}
)
train_y = [True, True, True, False, False, False]
validation_x = pandas.DataFrame(
{"column1": [10, 20, 30, 1, 2, 3]}
)
validation_y = [False, False, False, True, True, True]
automl = AutoSklearnClassifier()
automl.fit(train_x, train_y, validation_x, validation_y)
And this is the complete Traceback:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_1006/2928451066.py in <module>
14
15 automl = AutoSklearnClassifier()
---> 16 automl.fit(train_x, train_y, validation_x, validation_y)
/usr/local/lib/python3.7/site-packages/autosklearn/estimators.py in fit(self, X, y, X_test, y_test, feat_type, dataset_name)
945 y_test=y_test,
946 feat_type=feat_type,
--> 947 dataset_name=dataset_name,
948 )
949
/usr/local/lib/python3.7/site-packages/autosklearn/estimators.py in fit(self, **kwargs)
338 if self.automl_ is None:
339 self.automl_ = self.build_automl()
--> 340 self.automl_.fit(load_models=self.load_models, **kwargs)
341
342 return self
/usr/local/lib/python3.7/site-packages/autosklearn/automl.py in fit(self, X, y, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models)
1662 only_return_configuration_space=only_return_configuration_space,
1663 load_models=load_models,
-> 1664 is_classification=True,
1665 )
1666
/usr/local/lib/python3.7/site-packages/autosklearn/automl.py in fit(self, X, y, task, X_test, y_test, feat_type, dataset_name, only_return_configuration_space, load_models, is_classification)
640 # == Perform dummy predictions
641 # Dummy prediction always have num_run set to 1
--> 642 self.num_run += self._do_dummy_prediction(datamanager, num_run=1)
643
644 # == RUN ensemble builder
/usr/local/lib/python3.7/site-packages/autosklearn/automl.py in _do_dummy_prediction(self, datamanager, num_run)
422 raise ValueError(
423 "Dummy prediction failed with run state %s and additional output: %s."
--> 424 % (str(status), str(additional_info))
425 )
426 return num_run
ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap\n self.run()\n File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run\n self._target(*self._args, **self._kwargs)\n File "/usr/local/lib/python3.7/site-packages/pynisher/limit_function_call.py", line 133, in subprocess_func\n return_value = ((func(*args, **kwargs), 0))\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/__init__.py", line 40, in fit_predict_try_except_decorator\n return ta(queue=queue, **kwargs)\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 1164, in eval_holdout\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/train_evaluator.py", line 194, in __init__\n budget_type=budget_type,\n File "/usr/local/lib/python3.7/site-packages/autosklearn/evaluation/abstract_evaluator.py", line 199, in __init__\n threadpool_limits(limits=1)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 354, in __init__\n super().__init__(ThreadpoolController(), limits=limits, user_api=user_api)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 159, in __init__\n self._set_threadpool_limits()\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 285, in _set_threadpool_limits\n lib_controller.set_num_threads(num_threads)\n File "/usr/local/lib/python3.7/site-packages/threadpoolctl.py", line 809, in set_num_threads\n return set_func(num_threads)\nKeyboardInterrupt\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.
I have been getting this error as well on macOS Monterey 12.0
and auto-sklearn==0.13.0
, and I have not updated any libraries in my environment before this error started showing up. It happens when calling fit
regardless of parameters:
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "/Users/c91195a/Documents/experian/dragon/dragon/console.py", line 504, in train_console
train(
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/gin/config.py", line 1069, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
raise proxy.with_traceback(exception.__traceback__) from None
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/gin/config.py", line 1046, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/Users/c91195a/Documents/experian/dragon/dragon/train.py", line 436, in train
experiment.run()
File "/Users/c91195a/Documents/experian/dragon/dragon/experiment/experiment.py", line 180, in run
self.__fit()
File "/Users/c91195a/Documents/experian/dragon/dragon/experiment/experiment.py", line 52, in __fit
self.ml_estimator.fit(
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/sklearn/pipeline.py", line 346, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/experimental/askl2.py", line 425, in fit
return super().fit(
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/estimators.py", line 941, in fit
super().fit(
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/estimators.py", line 340, in fit
self.automl_.fit(load_models=self.load_models, **kwargs)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/automl.py", line 1655, in fit
return super().fit(
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/automl.py", line 642, in fit
self.num_run += self._do_dummy_prediction(datamanager, num_run=1)
File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/autosklearn/automl.py", line 422, in _do_dummy_prediction
raise ValueError(
ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap\n self.run()\n File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 108, in run\n self._target(*self._args, **self._kwargs)\n File "/Users/c91195a/Library/Caches/pypoetry/virtualenvs/dragon-oQHUJD0o-py3.8/lib/python3.8/site-packages/pynisher/limit_function_call.py", line 108, in subprocess_func\n resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.
In call to configurable 'train' (<function train at 0x7f9cbcc5f8b0>)
`
``
Hey @raphaelTrench you are getting a different error message that is not due to the number of cores. Instead, the memory limit you provide is above what can be passed to MacOS. You may try with a lower memory limit to see whether this issue goes away, but we don't test OSX and therefore cannot make any guarantees about the behavior of Auto-sklearn under OSX.
Hey @mfeurer, I understand. I tried with a lower memory configuration and it still didnt go away. However, in the case anyone faces this issue like me: the error stopped happening when I downgraded my MacOS version to below 12.0 (Monterey).
I get the same error https://github.com/automl/auto-sklearn/issues/360#issuecomment-963293965 On macOS Monterey with M1 Pro The installation was successful and importing the package works. It seems to be related to this. I understand auto-sklearn is not tested on macOS but I thought about reporting this known issue anyway in case someone finds a solution (which does not require downgrading the OS)
Hey @erinaldi,
We recently started reworking pynisher
which is in charge of limiting resources for spawned processes.
This error line is directly from pynisher and is in the comment you linked:
resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit
.
We have another push on getting it to work tomorrow hopefully but we still need a solution for Windows before we can make a release on that.
If you'd like more context or have any solutions, we can use the builtin python module resources
for limiting memory on Unix based systems but there is no windows equivalent, it's a unix only module. We need to find a substitute and then set up some local testing for it (we have no windows machines). There's also other discrepancies between the three core operating systems.
The error above seems to happen regardless of the memory you provide for RLIMIT_XXX
and we think that RLIMIT_AS
only works for Linux, or at least doesn't work on newer MAC OS systems.
If we can't get a windows version working soon, we will push the Mac fixed version as soon as we can and hopefully it will solve the issue for you :)
Best, Eddie
Thanks for the quick reply @eddiebergman
I would be happy to test it on macOS Monterey when you have a working PR. I did try different limits with no success like you said.
Right now this is not impacting my works since I am able to use a many-core Linux system but I’ll check this thread for any future update.
Hi @erinaldi,
I've updated the current status of this issue in Pynisher if you're interested: automl/pynisher#16
So is this issue now resolved? :\
Is it?
it still happens to me...
still getting this issue on 0.15.0
full output:
`ValueError Traceback (most recent call last) Cell In[16], line 54 27 y_train = rolling_train['close_shift_15'] 29 ######### 30 ### Autogluon 31 ######### (...) 52 ### AutoSklearn 53 ######### ---> 54 rf = cls.fit(X_train, y_train) 56 # rf = RandomForestRegressor().fit(X, y) # Original code 57 58 # Predict on test_vals 59 X_test = test_vals.drop('close_shift_15', axis = 1)
File ~/opt/anaconda3/lib/python3.8/site-packages/autosklearn/estimators.py:1587, in AutoSklearnRegressor.fit(self, X, y, X_test, y_test, feat_type, dataset_name) 1576 raise ValueError( 1577 "Regression with data of type {} is " 1578 "not supported. Supported types are {}. " (...) 1582 "".format(target_type, supported_types) 1583 ) 1585 # Fit is supposed to be idempotent! 1586 # But not if we use share_mode. ... 488 self._logger.error(msg) --> 489 raise ValueError(msg) 491 return
ValueError: (' Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n File "/Users/robertgrzesik/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap\n self.run()\n File "/Users/robertgrzesik/opt/anaconda3/lib/python3.8/multiprocessing/process.py", line 108, in run\n self._target(*self._args, **self._kwargs)\n File "/Users/robertgrzesik/opt/anaconda3/lib/python3.8/site-packages/pynisher/limit_function_call.py", line 108, in subprocess_func\n resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.',) Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...`
The workaround I found to fix this issue is to limit the number of cores with the env var
OPENBLAS_NUM_THREADS
before importing anything from autosklearn.For example:
import os os.environ["OPENBLAS_NUM_THREADS"] = "8" from autosklearn(...)
@sofidenner It works fine from a python file (.py file) but when I am trying to execute it through jupyter notebook its still throwing the same error.
though, I cross verified that the environment variable is set properly and I can print it using os.environ['OPENBLAS_NUM_THREADS']