auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

Can Autosklearn handle Multi-Class/Multi-Label Classification and which classifiers will it use?

Open asmgx opened this issue 3 years ago • 8 comments

I have been trying to use AutoSklearn with Multi-class classification

so my labels are like this

0 1 2 3 4 ... 200 1 0 1 1 1 ... 1 0 1 0 0 1 ... 0 1 0 0 1 0 ... 0 1 1 0 1 0 ... 1 0 1 1 0 1 ... 0 1 1 1 0 0 ... 1 1 0 1 0 1 ... 0

I used this code

y = y[:, (65,67,54,133,122,63,102,105,39)]
X = df.drop(Code, axis=1, errors='ignore')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


automl = autosklearn.classification.AutoSklearnClassifier(
include={'feature_preprocessor': ["no_preprocessing"], 
 },
exclude={ 'classifier': ['random_forest']},
time_left_for_this_task=60*5,
per_run_time_limit=60*1,
memory_limit = 1024 * 10,
n_jobs=-1,
metric=autosklearn.metrics.f1_macro,
        )

but now I want to train Autosklearn on Multi-class Multi-label classification

Which method of these shall i use?

1-

clf = OneVsRestClassifier(automl, n_jobs=-1)
clf.fit(X_train, y_train)

2-


clf = automl
clf.fit(X_train, y_train)

3-

I have to loop one class at a time and use

clf = automl
clf.fit(X_train, y_train)

so it will be like

for i in (65,67,54,133,122,63,102,105,39):
       y = z[:, i]
       X = df.drop(Code, axis=1, errors='ignore')
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      automl = autosklearn.classification.AutoSklearnClassifier(
      include={'feature_preprocessor': ["no_preprocessing"], 
       },
      exclude={ 'classifier': ['random_forest']},
      time_left_for_this_task=60*5,
      per_run_time_limit=60*1,
      memory_limit = 1024 * 10,
      n_jobs=1,
      metric=autosklearn.metrics.f1_macro,
              )


      clf = automl
      clf.fit(X_train, y_train)

so I get a different model for each label?

asmgx avatar Mar 25 '22 00:03 asmgx

Hey again @asmgx,

Just as a note, the example you give at first is multi-label as there are multiple label columns, and not just one.

Method 2 will not work as we do not natively support Multi-class mutli-label classification. This is due to the fact sklearn models usually don't support this naitevly and require adapters, similiar to the ones you show in option 1.. However option 1. will also not work, read the description of it carefully https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn-multiclass-onevsrestclassifier. It supports one or the other but not both simultaneously.

In general, I don't think support for multi-class multi-label is very widespread and I would advise reframing the problem as you suggest in 3.. One option as you suggest is to fit one classifier per multi-class target column, combining their results at the end. Another option is basically one hot-encode each multi-class target column into multiple binary one. In the same way you can one-hot encode categorical columns, you can do the same to target columns which contain multiple classes, repeating this for each column in your output. This can increase your target columns dramitically depending on the number of classes and it also makes translating between your original targets and the one-hot encoded variant more difficult to implement.

But to reiterate, we don't support it natively and implementation is left to the user.

Best, Eddie

eddiebergman avatar Mar 25 '22 11:03 eddiebergman

Hello to all,

For my undergraduate thesis, I am trying to benchmark some automl tools. Specifically, I am trying to plot ROC curves and calculate Area under ROC for multiclass (not multilabel) classification for some datasets coming from OpenML-CC18 using Autosklearn. Basicaly I am trying to implement this using AutoSklearnClassifier.

As eddiebergman already correctly pointed out, the clf = OneVsRestClassifier(automl, n_jobs=-1) clf.fit(X_train, y_train) bit cann't be directly used.

Can you please provide me an example of how can be done?

Thanks in advance!

vgargan2 avatar Apr 10 '22 16:04 vgargan2

Hi @vgargan2,

We support regular Multi-class classification out of the box. I realize we don't have an example to show this but we regular test on benchmark openml/s/218 which is similar in spirit to OpenML-CC18.

Incase this thread begins to confuse other readers, I'm going to make the 4 distinctions and clarify which we support.

  • Binary Classification - Supported | e.g. [0 1 1 0 0 1]
  • Multiclass Classification - Supported | e.g. [0, 2, 3, 1, 3, 3, 2, 1, 0]
  • Mutlilabel Classification - Supported | e.g. [[0, 1, 0], [1, 1, 0], [1, 0, 0]]
  • Multilabel Multiclass Classification Not Supported | e.g. [[1, 2, 0], [2, 1, 0], [3, 2, 1]]

Best, Eddie

eddiebergman avatar Apr 11 '22 05:04 eddiebergman

@eddiebergman this is confusing. you are saying that Mutlilabel Classification is supported, which is the same example I mentioned in the beginning of this post.

Do you mean if I have a data set with targeted values as following is Supported?

RowNo   Feature1  Feature2  Feature3   |  Label1   Label2   Label3   Label4   Label5
-------------------------------------------------------------------------------------------
1               73             84            34         |       0           1             1           0           1
2               37             88            84         |       0           0             0           1           1
3               93             90            58         |       1           0             1           1           0
4               77             44            66         |       1           1             1           0           0
5               48             82            38         |       1           1             0           1           1
6               53             87            42         |       0           1             0           0           0
7               80             55            28         |       1           0             0           1           0
8               66             74            97         |       0           0             1           1           1

Can you advice how can we work with this example?

asmgx avatar Apr 11 '22 13:04 asmgx

@asmgx, I apologise, I misread your example in the very first section. Yes it would support that example which is multilabel. Nothing needs to be done to support it, autosklearn will work out of the box with those labels automl = AutoSklearnClassifier(); automl.fit(X, y)

I read the column headers as being non binary and assumed you meant multiclass-multilabel classification, especially given the title of the issue.

This whole issue seems to illuminate that we should have a clear section about this. I also sometimes mix up which is multiclass vs multilabel as well as I don't expect everyone knows that you can combine the two to get the entirely different multiclass-multilabel which sklearn has limited support for.

For those scrolling to the bottom of the issue

# Nothing has to be done for mutli-label OR multi-class
X = np.random.rand(4, 2)  # 4 examples, 2 features


# For binary
binary_y = [1, 0, 1, 1]
automl = AutoSklearnClassifier()
automl.fit(X, binary_y)

# For multiclass
multiclass_y = [1, 2, 0, 2]
automl = AutoSklearnClassifier()
automl.fit(X, multiclass_y)

# For multilabel
multilabel_y = [[1, 0], [0, 0], [1, 1], [1, 0]]
automl = AutoSklearnClassifier()
automl.fit(X, multilabel_y)

# For multiclass-multilabel y
# NOT SUPPORTED
mutliclass_multilabel_y = [[1, 2], [0, 2], [0, 0], [2, 1]]

eddiebergman avatar Apr 11 '22 14:04 eddiebergman

@eddiebergman Thanks, is there more documentation on how does AutoSklearn support Multi-Label datasets? How it does build its models? I know that not all Algorithms support Multi-Labels natively, so does it use OneVsRestClassifier internally or does it loop over all the labels?

Any documents support that?

asmgx avatar Apr 11 '22 14:04 asmgx

There are no special things done, when doing multi-label classification, we only consider models that natively support multilabel classification.

https://github.com/automl/auto-sklearn/blob/6cc8bb179fcb023d1c341cf33d2958a16a6935be/autosklearn/pipeline/components/classification/init.py#L68

There's no document to support this but there probably should be to describe all this.

eddiebergman avatar Apr 11 '22 19:04 eddiebergman

We document the supported tasks here, but we should potentially rename this to "support target types" and link to scikit-learn's glossary, for example for multi-label we should make this a link to https://scikit-learn.org/stable/glossary.html#term-multilabel. Indeed, we have no documentation on which classifier is used for which target types and it would be great to have that.

mfeurer avatar Apr 19 '22 08:04 mfeurer