[FLAML Crash] [Regression] Do not support special JSON characters in feature name.
Hey, thanks for the great system.
I am experiencing a crash with two specific regression datasets: mv (used in the FLAML paper) and MagicTelescope. I get the following error when I fit FLAML on these datasets:
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1524, in fit
self._search()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 2009, in _search
self._search_sequential()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1825, in _search_sequential
use_ray=False,
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/tune/tune.py", line 382, in run
result = training_function(trial_to_run.config)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 240, in _compute_with_config_base
self.fit_kwargs,
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 328, in compute_estimator
fit_kwargs=fit_kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 267, in evaluate_model_CV
log_training_metric=log_training_metric, fit_kwargs=fit_kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 196, in get_test_loss
estimator.fit(X_train, y_train, budget, **fit_kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 318, in fit
self._t1 = self._fit(X_train, y_train, **kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 99, in _fit
model.fit(X_train, y_train, **kwargs)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/sklearn.py", line 899, in fit
categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
callbacks=callbacks
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/engine.py", line 271, in train
booster = Booster(params=params, train_set=train_set)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 2605, in __init__
train_set.construct()
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 1819, in construct
categorical_feature=self.categorical_feature, params=self.params)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 1573, in _lazy_init
return self.set_feature_name(feature_name)
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 2145, in set_feature_name
ctypes.c_int(len(feature_name))))
File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Do not support special JSON characters in feature name.
Here is my script:
df = pd.read_csv('mv.csv')
X, y = df.drop('target', axis=1), df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='regression',
time_budget=300,
retrain_full='budget',
verbose=0, metric='r2')
Your feedback is appreciated.
Could you share the .csv file? In the paper we used openml API to load the data instead of the .csv file. To reproduce the error I need to have the same .csv file used. Thanks.
Thanks @sonichi for your reply. Please find the .csv files of (mv) and (MagicTelescope).
Thanks @sonichi for your reply. Please find the .csv files of (mv) and (MagicTelescope).
Thanks. For mv, the problem is that the first column is a numeric id column. The following code works for me:
import pandas as pd
from flaml import AutoML
from sklearn.model_selection import train_test_split
df = pd.read_csv('mv.csv', index_col=0)
X, y = df.drop('target', axis=1), df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='regression',
time_budget=300,
retrain_full='budget',
verbose=3, metric='r2')
Thanks for your reply, @sonichi . I find it weird because FLAML worked for me on the housing prices datasets (CSV link), which also has the first column as ID. Also, would this issue happen if the ID column was not the first one?
Thanks for your reply, @sonichi . I find it weird because FLAML worked for me on the housing prices datasets (CSV link), which also has the first column as ID. Also, would this issue happen if the ID column was not the first one?
It "worked" because the first column is named, but the ID column is still used as a feature column which it shouldn't. So for housing price datasets using index_col=0 should give you better performance.
The LightGBMError is raised because the first column is unnamed in mv. So mv has two issues, and index_col=0 solves both.
I see, thanks, @sonichi . Can this be automated somehow? i.e. detecting which columns in a dataset is an index column?
I see, thanks, @sonichi . Can this be automated somehow? i.e. detecting which columns in a dataset is an index column?
It's an open question. The hard part is to be correct in all the cases.