miceforest icon indicating copy to clipboard operation
miceforest copied to clipboard

Categorical columns with nan

Open akshay6893 opened this issue 8 months ago • 1 comments

My data frame has only one column with ( catagorical ) missing values -

1 NaN 4 NaN 5 8 7 NaN 8 NaN ... 198895 NaN 198896 8 198897 NaN 198898 NaN 198899 NaN Name: Proposed_Use, Length: 83428, dtype: category Categories (84, int64): [0, 1, 2, 3, ..., 80, 81, 82, 83]

when i do -

# Assuming df is your DataFrame with missing values
kernel = mf.ImputationKernel(
    df_new.reset_index() ,
    num_datasets = 3,  # Number of imputed datasets to create
    save_all_iterations_data = True
)

# Run the MICE algorithm for a specified number of iterations
kernel.mice(2)  # Number of iterations

I get this error -

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 9
      2 kernel = mf.ImputationKernel(
      3     df_new.reset_index() ,
      4     num_datasets = 3,  # Number of imputed datasets to create
      5     save_all_iterations_data = True
      6 )
      8 # Run the MICE algorithm for a specified number of iterations
----> 9 kernel.mice(2)  # Number of iterations

File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/miceforest/imputation_kernel.py:1186, in ImputationKernel.mice(self, iterations, verbose, variable_parameters, **kwlgb)
   1182 logger.set_start_time(time_key)
   1183 bachelor_features = self._get_bachelor_features(
   1184     variable=variable
   1185 )
-> 1186 imputation_values = self._mean_match_mice(
   1187     variable=variable,
   1188     lgbmodel=current_model,
   1189     bachelor_features=bachelor_features,
   1190     candidate_features=candidate_features,
   1191     candidate_values=candidate_values,
   1192     dataset=dataset,
   1193     iteration=iteration,
   1194 )
   1195 imputation_values.index = self.na_where[variable]
   1196 logger.record_time(time_key)

File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/miceforest/imputation_kernel.py:971, in ImputationKernel._mean_match_mice(self, variable, lgbmodel, bachelor_features, candidate_features, candidate_values, dataset, iteration)
    961     candidate_preds = self._get_candidate_preds_mice(
    962         variable=variable,
    963         lgbmodel=lgbmodel,
   (...)
    966         iteration=iteration,
    967     )
    969     # By now, a numeric variable will be post-link, and
    970     # categorical / binary variables will be pre-link.
--> 971     imputation_values = self._mean_match_nearest_neighbors(
    972         mean_match_candidates=mean_match_candidates,
    973         bachelor_preds=bachelor_preds,
    974         candidate_preds=candidate_preds,
    975         candidate_values=candidate_values,
    976         random_state=self._random_state,
    977         hashed_seeds=None,
    978     )
    980 else:
    982     imputation_values = self._mean_match_fast(
    983         variable=variable,
    984         mean_match_candidates=mean_match_candidates,
   (...)
    987         hashed_seeds=None,
    988     )

File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/miceforest/imputation_kernel.py:602, in ImputationKernel._mean_match_nearest_neighbors(mean_match_candidates, bachelor_preds, candidate_preds, candidate_values, random_state, hashed_seeds)
    598 num_bachelors = bachelor_preds.shape[0]
    600 # balanced_tree = False fixes a recursion issue for some reason.
    601 # https://github.com/scipy/scipy/issues/14799
--> 602 kd_tree = KDTree(candidate_preds, leafsize=16, balanced_tree=False)
    603 _, knn_indices = kd_tree.query(
    604     bachelor_preds, k=mean_match_candidates, workers=-1
    605 )
    607 # We can skip the random selection process if mean_match_candidates == 1

File ~/miniconda3/envs/DataHidesBeauty/lib/python3.10/site-packages/scipy/spatial/_kdtree.py:360, in KDTree.__init__(self, data, leafsize, compact_nodes, copy_data, balanced_tree, boxsize)
    357     raise TypeError("KDTree does not work with complex data")
    359 # Note KDTree has different default leafsize from cKDTree
--> 360 super().__init__(data, leafsize, compact_nodes, copy_data,
    361                  balanced_tree, boxsize)

File _ckdtree.pyx:564, in scipy.spatial._ckdtree.cKDTree.__init__()

ValueError: data must be finite, check for nan or inf values`
```

mf == 6.0.3
scipy==1.15.2

akshay6893 avatar Mar 22 '25 13:03 akshay6893

I encountered the same problem. The cause appears to be the error ValueError: data must be finite, check for nan or inf values, which is thrown during the construction of scipy.spatial.KDTree. This is because KDTree requires the input data to be finite numerical values and cannot contain NaN or infinity (inf) values. Although the miceforest library is designed to impute missing values (NaN), during mean matching, KDTree is used to find nearest neighbors, and it does not support NaN values.

Problem Analysis:

During the mean matching process in miceforest, the _mean_match_nearest_neighbors method uses KDTree to find nearest neighbors. At this point, candidate_preds may contain NaN values, causing the KDTree construction to fail. One solution is to adjust the mean matching strategy. For example, setting mean_match_candidates to 0 will skip the mean matching process and directly use the predicted values for imputation:

kernel = mf.ImputationKernel(
    data=X,
    variable_schema=X.columns.to_list(),
    random_state=42,
    mean_match_candidates=0  # Skip mean matching
)

Later I found out that it may be because the column class is extremely unbalanced, resulting in "lightgbm output a probability of 1.0 or 0.0." "This is usually because of rare classes." "Try adjusting min_data_in_leaf." Then inf is generated when the probability given by the LGB is computed as the denominator. So it makes sense to set a small min_data_in_leaf for this feature alone, 10 seems to work empirically

Augustlnx avatar Mar 27 '25 03:03 Augustlnx

This problem has been introduced since miceforest>=6.0.0, you can avoid by using an older version (e.g.,) miceforest=5.7.0 - however, there are breaking changes between v5 and v6.

Anouther option is to 'catch' these catergorical or Int64 columns, and use the shap mean match scheme, something like:

caught_columns = df.select_dtypes(include=["int", "int64", "category"]).columns.to_list()

mm_strategy = {}
for col in df.columns:
    if col in caught_columns:
        mm_strategy[col] = 'shap'
    else:
        mm_strategy[col] = 'normal'

mf = miceforest.ImputationKernel(data=df, mean_match_strategy=mm_strategy)

benedictjones avatar Sep 01 '25 14:09 benedictjones

If you have a reproducible example where an unhelpful error is thrown, I can try to improve the error handling for this - but for now it's a tough problem to solve. You can't guarantee that lightgbm won't output a 1.0 or 0.0 probability before a model is trained, so this can't be checked at instantiation. I'll close this issue unless there's more discussion to be had here.

AnotherSamWilson avatar Oct 22 '25 13:10 AnotherSamWilson