shapash
shapash copied to clipboard
ValueError: The condensed distance matrix must contain only finite values.
Hi I'm using generate report with a LGBMClassifier for a binary classification. My data has categoricals and missing values which lightgbm can handle natively. I'm able to get the dashboard to run however when I try to generate a report with the following code:
xpl.generate_report(
output_file='report.html',
project_info_file='model.yml',
x_train=X_train,
y_train=y_train,
y_test=y_test,
title_story="CCA Default Risk",
metrics=[
{
'path': 'sklearn.metrics.f1_score',
'name': 'f1 score',
},
{
'path': 'sklearn.metrics.balanced_accuracy',
'name': 'Balanced Accuracy',
},
{
'path': 'sklearn.metrics.roc_auc',
'name': 'ROC AUC',
}
]
)
I get the following error:
PapermillExecutionError:
---------------------------------------------------------------------------
Exception encountered at "In [8]":
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 1
----> 1 report.display_dataset_analysis()
File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\report\project_report.py:284, in ProjectReport.display_dataset_analysis(self, global_analysis, univariate_analysis, target_analysis, multivariate_analysis)
282 if multivariate_analysis:
283 print_md("### Multivariate analysis")
--> 284 fig_corr = self.explainer.plot.correlations(
285 self.df_train_test,
286 facet_col='data_train_test',
287 max_features=20,
288 width=900 if len(self.df_train_test['data_train_test'].unique()) > 1 else 500,
289 height=500,
290 )
291 print_html(plotly.io.to_html(fig_corr))
292 print_md('---')
File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2296, in SmartPlotter.correlations(self, df, max_features, features_to_hide, facet_col, how, width, height, degree, decimals, file_name, auto_open)
2294 if len(list_features) == 0:
2295 top_features = compute_top_correlations_features(corr=corr, max_features=max_features)
-> 2296 corr = cluster_corr(corr.loc[top_features, top_features], degree=degree)
2297 list_features = list(corr.columns)
2299 fig.add_trace(
2300 go.Heatmap(
2301 z=corr.loc[list_features, list_features].round(decimals).values,
(...)
2308 hovertemplate=hovertemplate,
2309 ), row=1, col=i+1)
File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2244, in SmartPlotter.correlations.<locals>.cluster_corr(corr, degree, inplace)
2241 return corr
2243 pairwise_distances = sch.distance.pdist(corr**degree)
-> 2244 linkage = sch.linkage(pairwise_distances, method='complete')
2245 cluster_distance_threshold = pairwise_distances.max()/2
2246 idx_to_cluster_array = sch.fcluster(linkage, cluster_distance_threshold, criterion='distance')
File ~\Miniconda3\envs\pandas2\lib\site-packages\scipy\cluster\hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering)
1061 raise ValueError("`y` must be 1 or 2 dimensional.")
1063 if not np.all(np.isfinite(y)):
-> 1064 raise ValueError("The condensed distance matrix must contain only "
1065 "finite values.")
1067 n = int(distance.num_obs_y(y))
1068 method_code = _LINKAGE_METHODS[method]
ValueError: The condensed distance matrix must contain only finite values.
- Provide a minimal code snippet example that reproduces the bug.
- Provide screenshots where appropriate
- What's the version of Python you're using ?
- 3.9.16
- Are you using Mac, Linux or Windows?
- Windows 10
Python version : 3.9.16 Shapash version : 2.3.5 Operating System : Windows 10
Thank you for reporting us this bug, we'll fix it soon. Best regards.
Hi,
We have fix this issue, you can try with the new version of shapash 2.3.7
Hello @ThomasBouche , thanks for working on the issue.
I am afraid the issue is still open. I have just faced the same problem using the version 2.3.7.
I guess I understood the problem. The panda DataFrame received as corr
contains NaNs. Thus, pairwise_distances
will results in NaNs only, which triggers the error.
Analyzing the compute_corr
function that generates the corr
matrix we can see that df.corr()
is generating NaNs
du to the presence of constant values (once the standard deviation of a column with constant values is zero, which results in a division by zero in the correlation calculation).
Hello, Do you have an example so that I can reproduce the error? I tried to create an error with constant values, but it didn't create an error.
Furthermore, in the context of a machine learning model, in what cases does a feature have constant values?
Hi! I think I've run into the same issue. It seems to be triggered quite easily when there are a lot of NANs in the dataset. Are there any parameters I can set to skip this step?