shapash icon indicating copy to clipboard operation
shapash copied to clipboard

ValueError: The condensed distance matrix must contain only finite values.

Open sungla55guy opened this issue 1 year ago • 5 comments

Hi I'm using generate report with a LGBMClassifier for a binary classification. My data has categoricals and missing values which lightgbm can handle natively. I'm able to get the dashboard to run however when I try to generate a report with the following code:

xpl.generate_report(
    output_file='report.html', 
    project_info_file='model.yml',
    x_train=X_train,
    y_train=y_train,
    y_test=y_test,
    title_story="CCA Default Risk",
    metrics=[
        {
            'path': 'sklearn.metrics.f1_score',
            'name': 'f1 score',
        },
        {
            'path': 'sklearn.metrics.balanced_accuracy',
            'name': 'Balanced Accuracy',
        },
        {
            'path': 'sklearn.metrics.roc_auc',
            'name': 'ROC AUC',
        }
    ]
)

I get the following error:

PapermillExecutionError: 
---------------------------------------------------------------------------
Exception encountered at "In [8]":
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 report.display_dataset_analysis()

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\report\project_report.py:284, in ProjectReport.display_dataset_analysis(self, global_analysis, univariate_analysis, target_analysis, multivariate_analysis)
    282 if multivariate_analysis:
    283     print_md("### Multivariate analysis")
--> 284     fig_corr = self.explainer.plot.correlations(
    285         self.df_train_test,
    286         facet_col='data_train_test',
    287         max_features=20,
    288         width=900 if len(self.df_train_test['data_train_test'].unique()) > 1 else 500,
    289         height=500,
    290     )
    291     print_html(plotly.io.to_html(fig_corr))
    292 print_md('---')

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2296, in SmartPlotter.correlations(self, df, max_features, features_to_hide, facet_col, how, width, height, degree, decimals, file_name, auto_open)
   2294 if len(list_features) == 0:
   2295     top_features = compute_top_correlations_features(corr=corr, max_features=max_features)
-> 2296     corr = cluster_corr(corr.loc[top_features, top_features], degree=degree)
   2297     list_features = list(corr.columns)
   2299 fig.add_trace(
   2300     go.Heatmap(
   2301         z=corr.loc[list_features, list_features].round(decimals).values,
   (...)
   2308         hovertemplate=hovertemplate,
   2309     ), row=1, col=i+1)

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2244, in SmartPlotter.correlations.<locals>.cluster_corr(corr, degree, inplace)
   2241     return corr
   2243 pairwise_distances = sch.distance.pdist(corr**degree)
-> 2244 linkage = sch.linkage(pairwise_distances, method='complete')
   2245 cluster_distance_threshold = pairwise_distances.max()/2
   2246 idx_to_cluster_array = sch.fcluster(linkage, cluster_distance_threshold, criterion='distance')

File ~\Miniconda3\envs\pandas2\lib\site-packages\scipy\cluster\hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering)
   1061     raise ValueError("`y` must be 1 or 2 dimensional.")
   1063 if not np.all(np.isfinite(y)):
-> 1064     raise ValueError("The condensed distance matrix must contain only "
   1065                      "finite values.")
   1067 n = int(distance.num_obs_y(y))
   1068 method_code = _LINKAGE_METHODS[method]

ValueError: The condensed distance matrix must contain only finite values.
  • Provide a minimal code snippet example that reproduces the bug.
  • Provide screenshots where appropriate
  • What's the version of Python you're using ?
  • 3.9.16
  • Are you using Mac, Linux or Windows?
  • Windows 10

Python version : 3.9.16 Shapash version : 2.3.5 Operating System : Windows 10

sungla55guy avatar Aug 02 '23 12:08 sungla55guy

Thank you for reporting us this bug, we'll fix it soon. Best regards.

guillaume-vignal avatar Aug 22 '23 13:08 guillaume-vignal

Hi,

We have fix this issue, you can try with the new version of shapash 2.3.7

ThomasBouche avatar Sep 20 '23 12:09 ThomasBouche

Hello @ThomasBouche , thanks for working on the issue.

I am afraid the issue is still open. I have just faced the same problem using the version 2.3.7.

I guess I understood the problem. The panda DataFrame received as corr contains NaNs. Thus, pairwise_distances will results in NaNs only, which triggers the error.

Analyzing the compute_corr function that generates the corr matrix we can see that df.corr() is generating NaNs du to the presence of constant values (once the standard deviation of a column with constant values is zero, which results in a division by zero in the correlation calculation).

ekamioka avatar Oct 12 '23 17:10 ekamioka

Hello, Do you have an example so that I can reproduce the error? I tried to create an error with constant values, but it didn't create an error.

Furthermore, in the context of a machine learning model, in what cases does a feature have constant values?

ThomasBouche avatar Oct 17 '23 11:10 ThomasBouche

Hi! I think I've run into the same issue. It seems to be triggered quite easily when there are a lot of NANs in the dataset. Are there any parameters I can set to skip this step?

Augustlnx avatar Aug 14 '24 01:08 Augustlnx