auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

Third Party Components not shared with spawned child processes when n_jobs > 1

Open AmirAlavi opened this issue 3 years ago • 1 comments

Describe the bug

When building extensions to auto-sklearn, one has to "register" them with .add_components

I'm building a library that surrounds auto-sklearn with various extensions, and I provide the user with a script, where they can specify which components should be "turned on".

When I use this script with multiprocessing (passing n_jobs > 1 to the AutoSklearnClassifier), all of my runs crash, because they don't see the custom components I added.

To Reproduce

Below is a simplified scenario (see the .py files to reproduce at the end of this section), using Auto-sklearn's example code for extending Data Preprocessors. I added a boolean switch, PROTECTED_C to show the two ways of writing this code.

If you run the script with n_jobs = 1, there are no issues.

However, If you run the script with n_jobs > 1, i.e. python main.py -n 2, then you observe an issue depending on PROTECTED_C:

  • if PROTECTED_C = True: all auto-sklearn runs fail, below is the output, note how the avail_preprocessors: ... printed by the worker processes is missing NoPreprocessing. Please see the comments in main.py below, but I think that this is how one should structure driver scripts that use auto-sklearn, and thus the failed runs are unexpected behavior:
protected set-up code? True

avail_preprocessors: ['feature_type']
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:02,291:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:05,060:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,842:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:07,984:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']
[WARNING] [2022-11-09 23:05:08,072:Client-EnsembleBuilder] No runs were available to build an ensemble from
protected set-up code? True
avail_preprocessors: ['feature_type']

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

#########
CRASHED
ValueError("The provided component 'NoPreprocessing' for the key 'data_preprocessor' in the 'include' argument is not valid. The supported components for the step 'data_preprocessor' for this task are ['feature_type']")

auto-sklearn results:
  Dataset name: ef3cd4b2-6082-11ed-8f9c-0242ac110002
  Metric: accuracy
  Number of target algorithm runs: 5
  Number of successful target algorithm runs: 0
  Number of crashed target algorithm runs: 5
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0
  • if PROTECTED_C = False: no issues, but relies on unusually structured scripts. Output:
protected set-up code? False

parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
(__main__) avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']
protected set-up code? False
parsed args: Namespace(mode='kitchen-sink', n_jobs=2)
Configuring automl...
avail_preprocessors: ['feature_type', 'NoPreprocessing']

auto-sklearn results:
  Dataset name: 0a97408d-6083-11ed-9024-0242ac110002
  Metric: accuracy
  Best validation score: 0.943262
  Number of target algorithm runs: 5
  Number of successful target algorithm runs: 5
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

main.py:

PROTECTED_C = True
print(f"protected set-up code? {PROTECTED_C}")
print()

from smac.tae import StatusType

import autosklearn.classification
from autosklearn.pipeline.components.data_preprocessing import DataPreprocessorChoice

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import sklearn.metrics

import parse_args


def configure_automl(args):
    print("Configuring automl...")
    # ... "register" custom components with auto-sklearn, depending on args
    # For example, suppose a "simple" mode is requested, with no preprocessing
    from no_preprocessing import NoPreprocessing
    autosklearn.pipeline.components.data_preprocessing.add_preprocessor(NoPreprocessing)

if not PROTECTED_C:
    # This would be weird...it only makes sense to parse args if this module is being executed
    args = parse_args.parse_args()
    configure_automl(args)

avail_preprocessors = list(DataPreprocessorChoice.get_components())
print(f"avail_preprocessors: {avail_preprocessors}")

if __name__ == "__main__":
    if PROTECTED_C:
        # This is where I would expect to see argparsing logic, since it's only relevant if this script is being eecuted
        args = parse_args.parse_args()
        configure_automl(args)
    
    avail_preprocessors = list(DataPreprocessorChoice.get_components())
    print(f"(__main__) avail_preprocessors: {avail_preprocessors}")
    
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    clf = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=120,
        include={"data_preprocessor": ["NoPreprocessing"]},
        # Bellow two flags are provided to speed up calculations
        # Not recommended for a real implementation
        initial_configurations_via_metalearning=0,
        smac_scenario_args={"runcount_limit": 5},
        n_jobs=args.n_jobs
    )
    clf.fit(X_train, y_train)
    
    print()
    # Print out the error messages from crashed runs
    for run_key in clf.automl_.runhistory_.data:
        run_val = clf.automl_.runhistory_.data[run_key]
        if run_val.status == StatusType.CRASHED:
            print("#########")
            print("CRASHED")
            print(run_val.additional_info['error'])
            print()
            
    print(clf.sprint_statistics())

no_preprocessing.py:

from typing import Optional

from ConfigSpace.configuration_space import ConfigurationSpace

from autosklearn.askl_typing import FEAT_TYPE_TYPE
from autosklearn.pipeline.components.base import AutoSklearnPreprocessingAlgorithm
from autosklearn.pipeline.constants import SPARSE, DENSE, UNSIGNED_DATA, INPUT


class NoPreprocessing(AutoSklearnPreprocessingAlgorithm):
    def __init__(self, **kwargs):
        """This preprocessors does not change the data"""
        # Some internal checks makes sure parameters are set
        for key, val in kwargs.items():
            setattr(self, key, val)

    def fit(self, X, Y=None):
        return self

    def transform(self, X):
        return X

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            "shortname": "NoPreprocessing",
            "name": "NoPreprocessing",
            "handles_regression": True,
            "handles_classification": True,
            "handles_multiclass": True,
            "handles_multilabel": True,
            "handles_multioutput": True,
            "is_deterministic": True,
            "input": (SPARSE, DENSE, UNSIGNED_DATA),
            "output": (INPUT,),
        }

    @staticmethod
    def get_hyperparameter_search_space(
        feat_type: Optional[FEAT_TYPE_TYPE] = None, dataset_properties=None
    ):
        return ConfigurationSpace()  # Return an empty configuration as there is None

parse_args.py:

import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Investigate issues with Third Party components and concurrency",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument("-n", "--n_jobs", type=int, default=1, help="The number of jobs to run in parallel for fit(). -1 means using all processors.")
    parser.add_argument("-m", "--mode", type=str, choices=["kitchen-sink", "very-simple", "interpretable-models"], default="kitchen-sink", help="Dictates what is included or not in the search space of auto-sklearn.")
    args = parser.parse_args()
    print(f"parsed args: {args}")
    return args

Environment and installation:

I'm running all this using the auto-sklearn docker image built off master, mentioned here

  • OS == Linux (Docker)
  • Python version == 3.8.10
  • Auto-sklearn version == 0.15.0

Notes

I think this is because of how multiprocessing works, and using the "spawn" start method.

Maybe these portions of the codebase are relevant to this:

https://github.com/automl/auto-sklearn/blob/5c69ddf4584c5c7c4977203a1a579d042c6e3716/autosklearn/evaluation/init.py#L388

https://github.com/automl/auto-sklearn/blob/a7f73f1563a25a74200692f615fa44b34a8a942c/autosklearn/evaluation/abstract_evaluator.py#L291

AmirAlavi avatar Nov 09 '22 23:11 AmirAlavi

Hi @AmirAlavi,

Sorry to see that this issue has come back, I'm aware there are some issues related to multi-processing. I thought at one point this had been fixed, maybe not and maybe it regressed. As you can imagine, multi-processing testing like this can be a bit complicated, so I appreciate the scripts.

I'm currently working on updating auto-sklearn to the latest scikit-learn and our other core dependencies but these scripts will be helpful when I get a chance to look at this!

Many thanks, Eddie

eddiebergman avatar Nov 15 '22 16:11 eddiebergman