[FLAML Crash] [Classification] ValueError: Categorical categories must be unique
Hey, thanks for the great system.
I am experiencing a crash with a specific dataset. I get the following error when I fit FLAML on the Higgs dataset, a binary classification dataset used in the FLAML paper:
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1524, in fit
self._search()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 2009, in _search
self._search_sequential()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1825, in _search_sequential
use_ray=False,
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/tune/tune.py", line 382, in run
result = training_function(trial_to_run.config)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 240, in _compute_with_config_base
self.fit_kwargs,
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 323, in compute_estimator
log_training_metric=log_training_metric, fit_kwargs=fit_kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 196, in get_test_loss
estimator.fit(X_train, y_train, budget, **fit_kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 731, in fit
X_train = self._preprocess(X_train)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 689, in _preprocess
lambda x:
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/frame.py", line 8740, in apply
return op.apply()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 688, in apply
return self.apply_standard()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 812, in apply_standard
results, res_index = self.apply_series_generator()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
results[i] = self.f(v)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 692, in <lambda>
for c in x.cat.categories]))
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/accessor.py", line 93, in f
return self._delegate_method(name, *args, **kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2631, in _delegate_method
res = method(*args, **kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 1053, in rename_categories
cat.categories = new_categories
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 733, in categories
new_dtype = CategoricalDtype(categories, ordered=self.ordered)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 183, in __init__
self._finalize(categories, ordered, fastpath=False)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 337, in _finalize
categories = self.validate_categories(categories, fastpath=fastpath)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 540, in validate_categories
raise ValueError("Categorical categories must be unique")
ValueError: Categorical categories must be unique
Here is my script:
df = pd.read_csv('higgs.csv')
X, y = df.drop('class', axis=1), df['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='classification',
time_budget=300,
retrain_full='budget',
verbose=0, metric='macro_f1')
Your feedback is appreciated.
Could you share the .csv file? A few lines are enough as long as this error can be reproduced. Also, could you let me know the flaml version?
Thanks @sonichi for your reply. Please find the .csv file of (higgs). I am using FLAML v0.6.3
df = pd.read_csv('higgs.csv') X, y = df.drop('class', axis=1), df['class'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123) automl_model = AutoML() automl_model.fit(X_train, y_train, task='classification', time_budget=300, retrain_full='budget', verbose=0, metric='macro_f1')
Thanks. I received a warning when reading the csv:
sys:1: DtypeWarning: Columns (20,21,22,23,24,25,26,27,28) have mixed types.Specify dtype option on import or set low_memory=False.
Then, I found that the last row contains ? in it. After I removed ?, the warning is gone and I don't get an error.
Thanks for your reply, @sonichi . This worked for the higgs dataset, but I can imagine it might appear again for other datasets. Any plans to fix this in future releases?
Thanks for your reply, @sonichi . This worked for the higgs dataset, but I can imagine it might appear again for other datasets. Any plans to fix this in future releases?
Your suggestion is welcome here. I don't know how common it is to use "?" for missing data, and how we are supposed to infer that without explicit hint from users. For example, we can't simply replace all "?" by "" because it could be a legitimate value. What would you recommend to address this kind of ambiguity?
I have seen multiple OpenML datasets where the NaN values are stored as "?". I understand that a blind replacement of "?" with NaN might not be desired in some situation, but how about an analysis of the data types of column values? If e.g. 90%+ of values are integers, then "?" can be interpreted as NaN. If FLAML already has column data type checks, this can be integrated into it.
I have seen multiple OpenML datasets where the NaN values are stored as "?". I understand that a blind replacement of "?" with NaN might not be desired in some situation, but how about an analysis of the data types of column values? If e.g. 90%+ of values are integers, then "?" can be interpreted as NaN. If FLAML already has column data type checks, this can be integrated into it.
Interesting idea. Do you suggest replacing "?" with NaN automatically if 90%+ of values are integers and the remainder are "?"? What if I do use 0 and ? to represent two categories and I have 90% 0s in my data?