sparkling-water icon indicating copy to clipboard operation
sparkling-water copied to clipboard

Issue in DRFObtaining SHAP and Probability values

Open hemanshupaliwa7 opened this issue 4 years ago • 6 comments

Hello @jakubhava

I am using Spark 2.4 6 with Sparkling Water 3.30.1.2-1-2.4.

Platform is Ubuntu 16.04, sbt 0.13.15, scala 2.11.11

I am able to get SHAP and Probabilities from rest of the models in short duration whereas with DRF standalone or In AutoML (include only DRF) takes lot of time to get results....

Is there sime issue with DRF detailed_prediction?

Regards, Hemanshu

hemanshupaliwa7 avatar Sep 21 '20 15:09 hemanshupaliwa7

cc @michalkurka @honzasterba @Pscheidl

mn-mikke avatar Sep 21 '20 16:09 mn-mikke

Shapley is expensive to calculate, can you provide specifics about your model parameters, dataset, and the runtimes?

michalkurka avatar Sep 21 '20 16:09 michalkurka

Hello @michalkurka ,

I am trying to show parameters, dataset and runtimes for XGBoost and DRF (in AutoML).

Parameters image image

Model Trained Dataset distribution: image

Test Prediction Dataset: image

Runtimes for XGBoost: Here you can see the time taken to generated CrossTab of label vs prediction, and to expand probabilities and shap values. image

Runtimes for AutoML (DRF): Here you can see the time taken to generated CrossTab of label vs prediction, and to expand probabilities and shap values. image

Confusion Matrix Outpust looks like this: image

I am using mostly Default Parameters in both algos. Categorical Encoding in DRF is auto which is enum, so total features/variables are 119 and in case of XGBoost encoding is One Hot Encoding, so total features/variables expands to 390.

Logic to how to expand probabilities and shap values image

Issue is not just with ( Spark 2.4 6 with Sparkling Water 3.30.1.2-1-2.4.) I had faced issues with standalone DRF in 3.28.x as well. But AuoML(DRF) was working fine in 3.28.x

Hope it helps.

hemanshupaliwa7 avatar Sep 22 '20 04:09 hemanshupaliwa7

Just a thought ..... perhaps there's something with transform method. model.transform(dataframe)

hemanshupaliwa7 avatar Sep 22 '20 05:09 hemanshupaliwa7

Few more runtime details on XGBoost with more data volume than above one. These two jobs image

provides output as: image

Here I am only extracting probabilities not the SHAP values, and it doesn't take much time for XGBoost trained model to transform dataframe. And it's not the case with least amount of data as shown above for AutoML(DRF)/Standalone DRF.

hemanshupaliwa7 avatar Sep 22 '20 06:09 hemanshupaliwa7

@michalkurka any luck ??

hemanshupaliwa7 avatar Sep 24 '20 06:09 hemanshupaliwa7

closing that one because of no activity for a long time

krasinski avatar Dec 05 '22 17:12 krasinski