FLAML [FLAML Crash] [Regression] Do not support special JSON characters in feature name.

Hey, thanks for the great system.

I am experiencing a crash with two specific regression datasets: mv (used in the FLAML paper) and MagicTelescope. I get the following error when I fit FLAML on these datasets:

    File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1524, in fit
    self._search()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 2009, in _search
    self._search_sequential()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 1825, in _search_sequential
    use_ray=False,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/tune/tune.py", line 382, in run
    result = training_function(trial_to_run.config)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/automl.py", line 240, in _compute_with_config_base
    self.fit_kwargs,
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 328, in compute_estimator
    fit_kwargs=fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 267, in evaluate_model_CV
    log_training_metric=log_training_metric, fit_kwargs=fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/ml.py", line 196, in get_test_loss
    estimator.fit(X_train, y_train, budget, **fit_kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 318, in fit
    self._t1 = self._fit(X_train, y_train, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/flaml/model.py", line 99, in _fit
    model.fit(X_train, y_train, **kwargs)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/sklearn.py", line 899, in fit
    categorical_feature=categorical_feature, callbacks=callbacks, init_model=init_model)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
    callbacks=callbacks
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/engine.py", line 271, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 2605, in __init__
    train_set.construct()
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 1819, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 1573, in _lazy_init
    return self.set_feature_name(feature_name)
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 2145, in set_feature_name
    ctypes.c_int(len(feature_name))))
  File "/home/mossad/anaconda3/envs/kgpip/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Do not support special JSON characters in feature name.

Here is my script:

df = pd.read_csv('mv.csv')
X, y = df.drop('target', axis=1), df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='regression',
                           time_budget=300,
                           retrain_full='budget',
                           verbose=0, metric='r2')

Your feedback is appreciated.

May 13 '22 08:05 mossadhelali

Could you share the .csv file? In the paper we used openml API to load the data instead of the .csv file. To reproduce the error I need to have the same .csv file used. Thanks.

May 13 '22 14:05 sonichi

Thanks @sonichi for your reply. Please find the .csv files of (mv) and (MagicTelescope).

May 15 '22 03:05 mossadhelali

Thanks @sonichi for your reply. Please find the .csv files of (mv) and (MagicTelescope).

Thanks. For mv, the problem is that the first column is a numeric id column. The following code works for me:

import pandas as pd
from flaml import AutoML
from sklearn.model_selection import train_test_split

df = pd.read_csv('mv.csv', index_col=0)
X, y = df.drop('target', axis=1), df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
automl_model = AutoML()
automl_model.fit(X_train, y_train, task='regression',
                           time_budget=300,
                           retrain_full='budget',
                           verbose=3, metric='r2')

May 15 '22 18:05 sonichi

Thanks for your reply, @sonichi . I find it weird because FLAML worked for me on the housing prices datasets (CSV link), which also has the first column as ID. Also, would this issue happen if the ID column was not the first one?

May 17 '22 01:05 mossadhelali

Thanks for your reply, @sonichi . I find it weird because FLAML worked for me on the housing prices datasets (CSV link), which also has the first column as ID. Also, would this issue happen if the ID column was not the first one?

It "worked" because the first column is named, but the ID column is still used as a feature column which it shouldn't. So for housing price datasets using index_col=0 should give you better performance. The LightGBMError is raised because the first column is unnamed in mv. So mv has two issues, and index_col=0 solves both.

May 17 '22 01:05 sonichi

I see, thanks, @sonichi . Can this be automated somehow? i.e. detecting which columns in a dataset is an index column?

May 17 '22 17:05 mossadhelali

I see, thanks, @sonichi . Can this be automated somehow? i.e. detecting which columns in a dataset is an index column?

It's an open question. The hard part is to be correct in all the cases.

May 17 '22 20:05 sonichi