sparkling-water Issue in DRFObtaining SHAP and Probability values

Hello @jakubhava

I am using Spark 2.4 6 with Sparkling Water 3.30.1.2-1-2.4.

Platform is Ubuntu 16.04, sbt 0.13.15, scala 2.11.11

I am able to get SHAP and Probabilities from rest of the models in short duration whereas with DRF standalone or In AutoML (include only DRF) takes lot of time to get results....

Is there sime issue with DRF detailed_prediction?

Regards, Hemanshu

Sep 21 '20 15:09 hemanshupaliwa7

cc @michalkurka @honzasterba @Pscheidl

Sep 21 '20 16:09 mn-mikke

Shapley is expensive to calculate, can you provide specifics about your model parameters, dataset, and the runtimes?

Sep 21 '20 16:09 michalkurka

Hello @michalkurka ,

I am trying to show parameters, dataset and runtimes for XGBoost and DRF (in AutoML).

Parameters

Model Trained Dataset distribution:

Test Prediction Dataset:

Runtimes for XGBoost: Here you can see the time taken to generated CrossTab of label vs prediction, and to expand probabilities and shap values.

Runtimes for AutoML (DRF): Here you can see the time taken to generated CrossTab of label vs prediction, and to expand probabilities and shap values.

Confusion Matrix Outpust looks like this:

I am using mostly Default Parameters in both algos. Categorical Encoding in DRF is auto which is enum, so total features/variables are 119 and in case of XGBoost encoding is One Hot Encoding, so total features/variables expands to 390.

Logic to how to expand probabilities and shap values

Issue is not just with ( Spark 2.4 6 with Sparkling Water 3.30.1.2-1-2.4.) I had faced issues with standalone DRF in 3.28.x as well. But AuoML(DRF) was working fine in 3.28.x

Hope it helps.

Sep 22 '20 04:09 hemanshupaliwa7

Just a thought ..... perhaps there's something with transform method. model.transform(dataframe)

Sep 22 '20 05:09 hemanshupaliwa7

Few more runtime details on XGBoost with more data volume than above one. These two jobs

provides output as:

Here I am only extracting probabilities not the SHAP values, and it doesn't take much time for XGBoost trained model to transform dataframe. And it's not the case with least amount of data as shown above for AutoML(DRF)/Standalone DRF.

Sep 22 '20 06:09 hemanshupaliwa7

@michalkurka any luck ??

Sep 24 '20 06:09 hemanshupaliwa7

closing that one because of no activity for a long time

Dec 05 '22 17:12 krasinski

sparkling-water sparkling-water copied to clipboard

Issue in DRFObtaining SHAP and Probability values

sparkling-water
sparkling-water copied to clipboard