miceforest
miceforest copied to clipboard
Pandas dtypes modified when predicting on single records
Hi,
We encountered several issues when we tried to impute new dataset with one or few records. We ensured to convert the input dataset to the same types as we had during training, but we still get:
ValueError: train and valid dataset categorical_feature do not match.
After some investigations, we found out that dtypes of input data can change in the function _assign_col_values_without_copy from utils. We applied a quick and dirty fix on our side:
def _assign_col_values_without_copy(dat, row_ind, col_ind, val):
"""
Insert values into different data frame objects.
"""
row_ind = _ensure_iterable(row_ind)
dtype_bef = dat.iloc[:, col_ind].dtype
if isinstance(dat, pd_DataFrame):
dat.iloc[row_ind, col_ind] = val
dtype_aft = dat.iloc[:, col_ind].dtype
# Keep same dtype after values assignment
if dtype_bef != dtype_aft:
dat.iloc[:, col_ind] = dat.iloc[:, col_ind].astype(dtype_bef)
elif isinstance(dat, np.ndarray):
dat[row_ind, col_ind] = val
else:
raise ValueError("Unknown data class passed.")
It works now on individual predictions but we wanted to get your thoughts about that potential fix.
Thanks for your help!
Imputing single records is definitely an area that is lacking in testing. Thanks for bringing this to my attention. You mentioned several issues, can you open issues on github for them?
I misspoke sorry, all the errors we had were related to this specific _assign_col_values_without_copy function. For your information, we ran following tests:
- Impute new data without applying same dtypes for categorical columns. It fails but I guess this is the expected behavior
- Impute new data after applying dtypes schema. It fails as well because dtypes are getting updated by this utils function (only when we have few records)
Please let me know if I can further help!
I am actually testing this now, and the imputation works fine on this single row of a toy dataset:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import miceforest as mf
# Make random state and load data
# Define data
random_state = np.random.RandomState(5)
boston = pd.DataFrame(load_boston(return_X_y=True)[0])
boston.columns = [str(i) for i in boston.columns]
boston["3"] = boston["3"].map({0: 'a', 1: 'b'}).astype('category')
boston["8"] = boston["8"].astype("category")
boston_amp = mf.ampute_data(boston, perc=0.25, random_state=random_state)
kernel = mf.ImputationKernel(
data=boston_amp,
datasets=2,
save_models=1
)
kernel.mice(iterations=2, compile_candidates=True, verbose=True)
# Make sure single rows can be imputed
single_row = boston_amp.iloc[[0], :]
imp_sr = kernel.impute_new_data(single_row)
imp_sr.complete_data(0).dtypes == single_row.dtypes
I'm wondering if you are passing a series instead of a single row in a dataframe? If a pandas dataframe is subset like so:
single_row = boston_amp.iloc[0, :]
it will return a series and strip the datatypes, which is why I had to subset with [0] instead of 0 above.
I was passing a single row in a data frame and not a series. I have just tried to run your example and I get exactly the same error on my side:
ValueError: train and valid dataset categorical_feature do not match.
Can you share your pandas version ? I guess the issue comes from dependencies then
Hmmm I'm at work so I can't give you my pip freeze right now. However, that's a lightgbm error that is thrown. Can you tell me the lightgbm version you are using? It might need updated. I'm curious why lightgbm would throw that error too, miceforest doesn't make use of validation datasets in impute_new_data...
I am using lightgbm 3.3.2 which satisfies miceforest dependency. I think valid data in that case means the new data, so error is thrown when train categorical features != new data categorical features. The error with all the details:
File ../miceforest/ImputationKernel.py:1396, in ImputationKernel.impute_new_data(self, new_data, datasets, iterations, save_all_iterations, copy_data, random_state, random_seed_array, verbose)
1390 current_model = self.get_model(
1391 variable=var, dataset=ds, iteration=model_iteration
1392 )
1394 seeds = random_seed_array[nawhere] if use_seed_array else None
1395 imp_values = np.array(
-> 1396 self.mean_match_function(
1397 mmc=self.mean_match_candidates[var],
1398 model=current_model,
1399 candidate_features=candidate_features,
1400 bachelor_features=bachelor_features,
1401 candidate_values=candidate_values,
1402 random_state=random_state,
1403 hashed_seeds=seeds,
1404 )
1405 )
1406 imputed_data._insert_new_data(
1407 dataset=ds, variable_index=var, new_data=imp_values
1408 )
1409 # Refresh our seeds.
File ../miceforest/mean_matching_functions.py:98, in default_mean_match(mmc, model, candidate_features, bachelor_features, candidate_values, random_state, hashed_seeds)
88 assert objective in regressive_objectives + [
89 "binary",
90 "multiclass",
(...)
94 + "define a custom mean matching function to handle this objective."
95 )
97 # Need these no matter what.
---> 98 bachelor_preds = model.predict(bachelor_features)
99 num_bachelors = bachelor_preds.shape[0]
101 # mmc = 0 is deterministic
File ../lightgbm/basic.py:3538, in Booster.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape, **kwargs)
3536 else:
3537 num_iteration = -1
-> 3538 return predictor.predict(data, start_iteration, num_iteration,
3539 raw_score, pred_leaf, pred_contrib,
3540 data_has_header, is_reshape)
File ../lightgbm/basic.py:820, in _InnerPredictor.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape)
818 if isinstance(data, Dataset):
819 raise TypeError("Cannot use Dataset instance for prediction, please use raw data instead")
--> 820 data = _data_from_pandas(data, None, None, self.pandas_categorical)[0]
821 predict_type = C_API_PREDICT_NORMAL
822 if raw_score:
File ../lightgbm/basic.py:575, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical)
573 else:
574 if len(cat_cols) != len(pandas_categorical):
--> 575 raise ValueError('train and valid dataset categorical_feature do not match.')
576 for col, category in zip(cat_cols, pandas_categorical):
577 if list(data[col].cat.categories) != list(category):
ValueError: train and valid dataset categorical_feature do not match.
If this is a part of a pipeline, can you make a pipeline that involves all the steps up to imputation, and look at the resulting data that gets sent to impute_new_data? That should shed light on what's happening.
That error comes from the test sample script you provided above, so this is not part of the pipeline. And when it is part of a pipeline, imputation comes first so there is no data transformation before impute_new_data is called.
Here is my pip freeze
alabaster==0.7.12
atomicwrites==1.4.0
attrs==21.2.0
Babel==2.9.1
beautifulsoup4==4.10.0
black==22.6.0
bleach==4.1.0
blosc==1.10.6
certifi==2021.5.30
charset-normalizer==2.0.4
click==8.1.3
cloudpickle==2.0.0
colorama==0.4.4
cycler==0.10.0
dill==0.3.4
docutils==0.17.1
et-xmlfile==1.1.0
faiss-cpu==1.7.1.post2
html5lib==1.1
idna==3.2
imagesize==1.2.0
importlib-metadata==4.8.1
iniconfig==1.1.1
Jinja2==3.0.2
joblib==1.0.1
keyring==23.1.0
kiwisolver==1.3.2
lightgbm==3.3.1
llvmlite==0.37.0
lxml==4.6.3
MarkupSafe==2.0.1
matplotlib==3.4.3
miceforest==5.6.2
mypy==0.971
mypy-extensions==0.4.3
numba==0.54.1
numpy==1.22.4
openpyxl==3.0.10
packaging==21.0
pandas==1.3.3
pathspec==0.9.0
Pillow==8.3.2
pkginfo==1.7.1
platformdirs==2.5.2
plotly==5.5.0
pluggy==1.0.0
py==1.10.0
pyarrow==5.0.0
Pygments==2.10.0
pykdtree==1.3.4
pyparsing==2.4.7
pytest==6.2.5
python-dateutil==2.8.2
pytz==2021.1
pywin32-ctypes==0.2.0
readme-renderer==29.0
requests==2.26.0
requests-toolbelt==0.9.1
rfc3986==1.5.0
scikit-learn==0.24.2
scipy==1.7.3
seaborn==0.11.2
shap==0.40.0
six==1.16.0
sklearn==0.0
slicer==0.0.7
snowballstemmer==2.1.0
soupsieve==2.2.1
Sphinx==4.2.0
sphinx-rtd-theme==1.0.0
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
tenacity==8.0.1
threadpoolctl==2.2.0
toml==0.10.2
tomli==2.0.1
tqdm==4.62.2
twine==3.4.2
typing-extensions==4.3.0
urllib3==1.26.6
webencodings==0.5.1
xlsx2html==0.4.0
zipp==3.5.0
Did this solve your issue? If not can you send me your requirements file?
Sorry, I did not have time to test your requirements yesterday. So, after some investigations, it looks like it is not related to the dependencies but more to the test cases. If I run the Boston test case with the same environment, but with predicting the first 3 values:
single_row = boston_amp.iloc[0:3, :]
You will get the error ValueError: train and valid dataset categorical_feature do not match.
The error comes from that dat.iloc[row_ind, col_ind] = val in _assign_col_values_without_copy utils function. It transforms the type of the column to object instead of keeping it as a category column.
Hope it helps!