miceforest Pandas dtypes modified when predicting on single records

Hi,

We encountered several issues when we tried to impute new dataset with one or few records. We ensured to convert the input dataset to the same types as we had during training, but we still get:

ValueError: train and valid dataset categorical_feature do not match.

After some investigations, we found out that dtypes of input data can change in the function _assign_col_values_without_copy from utils. We applied a quick and dirty fix on our side:

def _assign_col_values_without_copy(dat, row_ind, col_ind, val):
    """
    Insert values into different data frame objects.
    """
    row_ind = _ensure_iterable(row_ind)
    dtype_bef = dat.iloc[:, col_ind].dtype
    if isinstance(dat, pd_DataFrame):
        dat.iloc[row_ind, col_ind] = val
        dtype_aft = dat.iloc[:, col_ind].dtype
        # Keep same dtype after values assignment
        if dtype_bef != dtype_aft:
            dat.iloc[:, col_ind] = dat.iloc[:, col_ind].astype(dtype_bef)
    elif isinstance(dat, np.ndarray):
        dat[row_ind, col_ind] = val
    else:
        raise ValueError("Unknown data class passed.")

It works now on individual predictions but we wanted to get your thoughts about that potential fix.

Thanks for your help!

Sep 23 '22 07:09 getchepare2

Imputing single records is definitely an area that is lacking in testing. Thanks for bringing this to my attention. You mentioned several issues, can you open issues on github for them?

Sep 23 '22 13:09 AnotherSamWilson

I misspoke sorry, all the errors we had were related to this specific _assign_col_values_without_copy function. For your information, we ran following tests:

Impute new data without applying same dtypes for categorical columns. It fails but I guess this is the expected behavior
Impute new data after applying dtypes schema. It fails as well because dtypes are getting updated by this utils function (only when we have few records)

Please let me know if I can further help!

Sep 27 '22 11:09 getchepare2

I am actually testing this now, and the imputation works fine on this single row of a toy dataset:

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import miceforest as mf

# Make random state and load data
# Define data
random_state = np.random.RandomState(5)
boston = pd.DataFrame(load_boston(return_X_y=True)[0])
boston.columns = [str(i) for i in boston.columns]
boston["3"] = boston["3"].map({0: 'a', 1: 'b'}).astype('category')
boston["8"] = boston["8"].astype("category")
boston_amp = mf.ampute_data(boston, perc=0.25, random_state=random_state)

kernel = mf.ImputationKernel(
    data=boston_amp,
    datasets=2,
    save_models=1
)
kernel.mice(iterations=2, compile_candidates=True, verbose=True)

# Make sure single rows can be imputed
single_row = boston_amp.iloc[[0], :]
imp_sr = kernel.impute_new_data(single_row)
imp_sr.complete_data(0).dtypes == single_row.dtypes

I'm wondering if you are passing a series instead of a single row in a dataframe? If a pandas dataframe is subset like so:

single_row = boston_amp.iloc[0, :]

it will return a series and strip the datatypes, which is why I had to subset with [0] instead of 0 above.

Sep 28 '22 13:09 AnotherSamWilson

I was passing a single row in a data frame and not a series. I have just tried to run your example and I get exactly the same error on my side:

ValueError: train and valid dataset categorical_feature do not match.

Can you share your pandas version ? I guess the issue comes from dependencies then

Sep 29 '22 09:09 getchepare2

Hmmm I'm at work so I can't give you my pip freeze right now. However, that's a lightgbm error that is thrown. Can you tell me the lightgbm version you are using? It might need updated. I'm curious why lightgbm would throw that error too, miceforest doesn't make use of validation datasets in impute_new_data...

Sep 29 '22 13:09 AnotherSamWilson

I am using lightgbm 3.3.2 which satisfies miceforest dependency. I think valid data in that case means the new data, so error is thrown when train categorical features != new data categorical features. The error with all the details:

File ../miceforest/ImputationKernel.py:1396, in ImputationKernel.impute_new_data(self, new_data, datasets, iterations, save_all_iterations, copy_data, random_state, random_seed_array, verbose)
   1390 current_model = self.get_model(
   1391     variable=var, dataset=ds, iteration=model_iteration
   1392 )
   1394 seeds = random_seed_array[nawhere] if use_seed_array else None
   1395 imp_values = np.array(
-> 1396     self.mean_match_function(
   1397         mmc=self.mean_match_candidates[var],
   1398         model=current_model,
   1399         candidate_features=candidate_features,
   1400         bachelor_features=bachelor_features,
   1401         candidate_values=candidate_values,
   1402         random_state=random_state,
   1403         hashed_seeds=seeds,
   1404     )
   1405 )
   1406 imputed_data._insert_new_data(
   1407     dataset=ds, variable_index=var, new_data=imp_values
   1408 )
   1409 # Refresh our seeds.

File ../miceforest/mean_matching_functions.py:98, in default_mean_match(mmc, model, candidate_features, bachelor_features, candidate_values, random_state, hashed_seeds)
     88 assert objective in regressive_objectives + [
     89     "binary",
     90     "multiclass",
   (...)
     94     + "define a custom mean matching function to handle this objective."
     95 )
     97 # Need these no matter what.
---> 98 bachelor_preds = model.predict(bachelor_features)
     99 num_bachelors = bachelor_preds.shape[0]
    101 # mmc = 0 is deterministic

File ../lightgbm/basic.py:3538, in Booster.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape, **kwargs)
   3536     else:
   3537         num_iteration = -1
-> 3538 return predictor.predict(data, start_iteration, num_iteration,
   3539                          raw_score, pred_leaf, pred_contrib,
   3540                          data_has_header, is_reshape)

File ../lightgbm/basic.py:820, in _InnerPredictor.predict(self, data, start_iteration, num_iteration, raw_score, pred_leaf, pred_contrib, data_has_header, is_reshape)
    818 if isinstance(data, Dataset):
    819     raise TypeError("Cannot use Dataset instance for prediction, please use raw data instead")
--> 820 data = _data_from_pandas(data, None, None, self.pandas_categorical)[0]
    821 predict_type = C_API_PREDICT_NORMAL
    822 if raw_score:

File ../lightgbm/basic.py:575, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical)
    573 else:
    574     if len(cat_cols) != len(pandas_categorical):
--> 575         raise ValueError('train and valid dataset categorical_feature do not match.')
    576     for col, category in zip(cat_cols, pandas_categorical):
    577         if list(data[col].cat.categories) != list(category):

ValueError: train and valid dataset categorical_feature do not match.

Sep 29 '22 14:09 getchepare2

If this is a part of a pipeline, can you make a pipeline that involves all the steps up to imputation, and look at the resulting data that gets sent to impute_new_data? That should shed light on what's happening.

Sep 29 '22 14:09 AnotherSamWilson

That error comes from the test sample script you provided above, so this is not part of the pipeline. And when it is part of a pipeline, imputation comes first so there is no data transformation before impute_new_data is called.

Sep 29 '22 14:09 getchepare2

Here is my pip freeze

alabaster==0.7.12
atomicwrites==1.4.0
attrs==21.2.0
Babel==2.9.1
beautifulsoup4==4.10.0
black==22.6.0
bleach==4.1.0
blosc==1.10.6
certifi==2021.5.30
charset-normalizer==2.0.4
click==8.1.3
cloudpickle==2.0.0
colorama==0.4.4
cycler==0.10.0
dill==0.3.4
docutils==0.17.1
et-xmlfile==1.1.0
faiss-cpu==1.7.1.post2
html5lib==1.1
idna==3.2
imagesize==1.2.0
importlib-metadata==4.8.1
iniconfig==1.1.1
Jinja2==3.0.2
joblib==1.0.1
keyring==23.1.0
kiwisolver==1.3.2
lightgbm==3.3.1
llvmlite==0.37.0
lxml==4.6.3
MarkupSafe==2.0.1
matplotlib==3.4.3
miceforest==5.6.2
mypy==0.971
mypy-extensions==0.4.3
numba==0.54.1
numpy==1.22.4
openpyxl==3.0.10
packaging==21.0
pandas==1.3.3
pathspec==0.9.0
Pillow==8.3.2
pkginfo==1.7.1
platformdirs==2.5.2
plotly==5.5.0
pluggy==1.0.0
py==1.10.0
pyarrow==5.0.0
Pygments==2.10.0
pykdtree==1.3.4
pyparsing==2.4.7
pytest==6.2.5
python-dateutil==2.8.2
pytz==2021.1
pywin32-ctypes==0.2.0
readme-renderer==29.0
requests==2.26.0
requests-toolbelt==0.9.1
rfc3986==1.5.0
scikit-learn==0.24.2
scipy==1.7.3
seaborn==0.11.2
shap==0.40.0
six==1.16.0
sklearn==0.0
slicer==0.0.7
snowballstemmer==2.1.0
soupsieve==2.2.1
Sphinx==4.2.0
sphinx-rtd-theme==1.0.0
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
tenacity==8.0.1
threadpoolctl==2.2.0
toml==0.10.2
tomli==2.0.1
tqdm==4.62.2
twine==3.4.2
typing-extensions==4.3.0
urllib3==1.26.6
webencodings==0.5.1
xlsx2html==0.4.0
zipp==3.5.0

Sep 30 '22 12:09 AnotherSamWilson

Did this solve your issue? If not can you send me your requirements file?

Oct 03 '22 20:10 AnotherSamWilson

Sorry, I did not have time to test your requirements yesterday. So, after some investigations, it looks like it is not related to the dependencies but more to the test cases. If I run the Boston test case with the same environment, but with predicting the first 3 values:

single_row = boston_amp.iloc[0:3, :]

You will get the error ValueError: train and valid dataset categorical_feature do not match.

The error comes from that dat.iloc[row_ind, col_ind] = val in _assign_col_values_without_copy utils function. It transforms the type of the column to object instead of keeping it as a category column.

Hope it helps!

Oct 04 '22 08:10 getchepare2

miceforest miceforest copied to clipboard

Pandas dtypes modified when predicting on single records

miceforest
miceforest copied to clipboard