miceforest
miceforest copied to clipboard
Categorical columns with nan
My data frame has only one column with ( catagorical ) missing values -
1 NaN 4 NaN 5 8 7 NaN 8 NaN ... 198895 NaN 198896 8 198897 NaN 198898 NaN 198899 NaN Name: Proposed_Use, Length: 83428, dtype: category Categories (84, int64): [0, 1, 2, 3, ..., 80, 81, 82, 83]
when i do -
# Assuming df is your DataFrame with missing values
kernel = mf.ImputationKernel(
df_new.reset_index() ,
num_datasets = 3, # Number of imputed datasets to create
save_all_iterations_data = True
)
# Run the MICE algorithm for a specified number of iterations
kernel.mice(2) # Number of iterations
I get this error -
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[17], line 9
2 kernel = mf.ImputationKernel(
3 df_new.reset_index() ,
4 num_datasets = 3, # Number of imputed datasets to create
5 save_all_iterations_data = True
6 )
8 # Run the MICE algorithm for a specified number of iterations
----> 9 kernel.mice(2) # Number of iterations
File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/miceforest/imputation_kernel.py:1186, in ImputationKernel.mice(self, iterations, verbose, variable_parameters, **kwlgb)
1182 logger.set_start_time(time_key)
1183 bachelor_features = self._get_bachelor_features(
1184 variable=variable
1185 )
-> 1186 imputation_values = self._mean_match_mice(
1187 variable=variable,
1188 lgbmodel=current_model,
1189 bachelor_features=bachelor_features,
1190 candidate_features=candidate_features,
1191 candidate_values=candidate_values,
1192 dataset=dataset,
1193 iteration=iteration,
1194 )
1195 imputation_values.index = self.na_where[variable]
1196 logger.record_time(time_key)
File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/miceforest/imputation_kernel.py:971, in ImputationKernel._mean_match_mice(self, variable, lgbmodel, bachelor_features, candidate_features, candidate_values, dataset, iteration)
961 candidate_preds = self._get_candidate_preds_mice(
962 variable=variable,
963 lgbmodel=lgbmodel,
(...)
966 iteration=iteration,
967 )
969 # By now, a numeric variable will be post-link, and
970 # categorical / binary variables will be pre-link.
--> 971 imputation_values = self._mean_match_nearest_neighbors(
972 mean_match_candidates=mean_match_candidates,
973 bachelor_preds=bachelor_preds,
974 candidate_preds=candidate_preds,
975 candidate_values=candidate_values,
976 random_state=self._random_state,
977 hashed_seeds=None,
978 )
980 else:
982 imputation_values = self._mean_match_fast(
983 variable=variable,
984 mean_match_candidates=mean_match_candidates,
(...)
987 hashed_seeds=None,
988 )
File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/miceforest/imputation_kernel.py:602, in ImputationKernel._mean_match_nearest_neighbors(mean_match_candidates, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
598 num_bachelors = bachelor_preds.shape[0]
600 # balanced_tree = False fixes a recursion issue for some reason.
601 # https://github.com/scipy/scipy/issues/14799
--> 602 kd_tree = KDTree(candidate_preds, leafsize=16, balanced_tree=False)
603 _, knn_indices = kd_tree.query(
604 bachelor_preds, k=mean_match_candidates, workers=-1
605 )
607 # We can skip the random selection process if mean_match_candidates == 1
File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/scipy/spatial/_kdtree.py:360, in KDTree.__init__(self, data, leafsize, compact_nodes, copy_data, balanced_tree, boxsize)
357 raise TypeError("KDTree does not work with complex data")
359 # Note KDTree has different default leafsize from cKDTree
--> 360 super().__init__(data, leafsize, compact_nodes, copy_data,
361 balanced_tree, boxsize)
File _ckdtree.pyx:564, in scipy.spatial._ckdtree.cKDTree.__init__()
ValueError: data must be finite, check for nan or inf values`
```
mf == 6.0.3
scipy==1.15.2
I encountered the same problem. The cause appears to be the error ValueError: data must be finite, check for nan or inf values, which is thrown during the construction of scipy.spatial.KDTree. This is because KDTree requires the input data to be finite numerical values and cannot contain NaN or infinity (inf) values. Although the miceforest library is designed to impute missing values (NaN), during mean matching, KDTree is used to find nearest neighbors, and it does not support NaN values.
Problem Analysis:
During the mean matching process in miceforest, the _mean_match_nearest_neighbors method uses KDTree to find nearest neighbors. At this point, candidate_preds may contain NaN values, causing the KDTree construction to fail. One solution is to adjust the mean matching strategy. For example, setting mean_match_candidates to 0 will skip the mean matching process and directly use the predicted values for imputation:
kernel = mf.ImputationKernel(
data=X,
variable_schema=X.columns.to_list(),
random_state=42,
mean_match_candidates=0 # Skip mean matching
)
Later I found out that it may be because the column class is extremely unbalanced, resulting in "lightgbm output a probability of 1.0 or 0.0." "This is usually because of rare classes." "Try adjusting min_data_in_leaf." Then inf is generated when the probability given by the LGB is computed as the denominator. So it makes sense to set a small min_data_in_leaf for this feature alone, 10 seems to work empirically
This problem has been introduced since miceforest>=6.0.0, you can avoid by using an older version (e.g.,) miceforest=5.7.0 - however, there are breaking changes between v5 and v6.
Anouther option is to 'catch' these catergorical or Int64 columns, and use the shap mean match scheme, something like:
caught_columns = df.select_dtypes(include=["int", "int64", "category"]).columns.to_list()
mm_strategy = {}
for col in df.columns:
if col in caught_columns:
mm_strategy[col] = 'shap'
else:
mm_strategy[col] = 'normal'
mf = miceforest.ImputationKernel(data=df, mean_match_strategy=mm_strategy)
If you have a reproducible example where an unhelpful error is thrown, I can try to improve the error handling for this - but for now it's a tough problem to solve. You can't guarantee that lightgbm won't output a 1.0 or 0.0 probability before a model is trained, so this can't be checked at instantiation. I'll close this issue unless there's more discussion to be had here.