FLAML [FLAML Crash] [Classification] ValueError: Categorical categories must be unique

Hey, thanks for the great system.

I am experiencing a crash with a specific dataset. I get the following error when I fit FLAML on the Higgs dataset, a binary classification dataset used in the FLAML paper:

  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1524, in fit
    self._search()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 2009, in _search
    self._search_sequential()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1825, in _search_sequential
    use_ray=False,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/tune/tune.py", line 382, in run
    result = training_function(trial_to_run.config)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 240, in _compute_with_config_base
    self.fit_kwargs,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 323, in compute_estimator
    log_training_metric=log_training_metric, fit_kwargs=fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 196, in get_test_loss
    estimator.fit(X_train, y_train, budget, **fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 731, in fit
    X_train = self._preprocess(X_train)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 689, in _preprocess
    lambda x:
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/frame.py", line 8740, in apply
    return op.apply()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 688, in apply
    return self.apply_standard()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 812, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
    results[i] = self.f(v)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 692, in <lambda>
    for c in x.cat.categories]))
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/accessor.py", line 93, in f
    return self._delegate_method(name, *args, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2631, in _delegate_method
    res = method(*args, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 1053, in rename_categories
    cat.categories = new_categories
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 733, in categories
    new_dtype = CategoricalDtype(categories, ordered=self.ordered)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 183, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 337, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 540, in validate_categories
    raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique

Here is my script:

df = pd.read_csv('higgs.csv')
X, y = df.drop('class', axis=1), df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='classification',
                           time_budget=300,
                           retrain_full='budget',
                           verbose=0, metric='macro_f1')

Your feedback is appreciated.

May 13 '22 07:05 mossadhelali

Could you share the .csv file? A few lines are enough as long as this error can be reproduced. Also, could you let me know the flaml version?

May 13 '22 14:05 sonichi

Thanks @sonichi for your reply. Please find the .csv file of (higgs). I am using FLAML v0.6.3

May 15 '22 03:05 mossadhelali

df = pd.read_csv('higgs.csv') X, y = df.drop('class', axis=1), df['class'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123) automl_model = AutoML() automl_model.fit(X_train, y_train, task='classification', time_budget=300, retrain_full='budget', verbose=0, metric='macro_f1')

Thanks. I received a warning when reading the csv:

sys:1: DtypeWarning: Columns (20,21,22,23,24,25,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.

Then, I found that the last row contains ? in it. After I removed ?, the warning is gone and I don't get an error.

May 15 '22 18:05 sonichi

Thanks for your reply, @sonichi . This worked for the higgs dataset, but I can imagine it might appear again for other datasets. Any plans to fix this in future releases?

May 17 '22 01:05 mossadhelali

Thanks for your reply, @sonichi . This worked for the higgs dataset, but I can imagine it might appear again for other datasets. Any plans to fix this in future releases?

Your suggestion is welcome here. I don't know how common it is to use "?" for missing data, and how we are supposed to infer that without explicit hint from users. For example, we can't simply replace all "?" by "" because it could be a legitimate value. What would you recommend to address this kind of ambiguity?

May 17 '22 01:05 sonichi

I have seen multiple OpenML datasets where the NaN values are stored as "?". I understand that a blind replacement of "?" with NaN might not be desired in some situation, but how about an analysis of the data types of column values? If e.g. 90%+ of values are integers, then "?" can be interpreted as NaN. If FLAML already has column data type checks, this can be integrated into it.

May 17 '22 16:05 mossadhelali

I have seen multiple OpenML datasets where the NaN values are stored as "?". I understand that a blind replacement of "?" with NaN might not be desired in some situation, but how about an analysis of the data types of column values? If e.g. 90%+ of values are integers, then "?" can be interpreted as NaN. If FLAML already has column data type checks, this can be integrated into it.

Interesting idea. Do you suggest replacing "?" with NaN automatically if 90%+ of values are integers and the remainder are "?"? What if I do use 0 and ? to represent two categories and I have 90% 0s in my data?

May 17 '22 20:05 sonichi