sparkling-water
sparkling-water copied to clipboard
Issue in DRFObtaining SHAP and Probability values
Hello @jakubhava
I am using Spark 2.4 6 with Sparkling Water 3.30.1.2-1-2.4.
Platform is Ubuntu 16.04, sbt 0.13.15, scala 2.11.11
I am able to get SHAP and Probabilities from rest of the models in short duration whereas with DRF standalone or In AutoML (include only DRF) takes lot of time to get results....
Is there sime issue with DRF detailed_prediction?
Regards, Hemanshu
cc @michalkurka @honzasterba @Pscheidl
Shapley is expensive to calculate, can you provide specifics about your model parameters, dataset, and the runtimes?
Hello @michalkurka ,
I am trying to show parameters, dataset and runtimes for XGBoost and DRF (in AutoML).
Parameters
Model Trained Dataset distribution:
Test Prediction Dataset:
Runtimes for XGBoost:
Here you can see the time taken to generated CrossTab of label vs prediction, and to expand probabilities and shap values.
Runtimes for AutoML (DRF):
Here you can see the time taken to generated CrossTab of label vs prediction, and to expand probabilities and shap values.
Confusion Matrix Outpust looks like this:
I am using mostly Default Parameters in both algos. Categorical Encoding in DRF is auto which is enum, so total features/variables are 119 and in case of XGBoost encoding is One Hot Encoding, so total features/variables expands to 390.
Logic to how to expand probabilities and shap values
Issue is not just with ( Spark 2.4 6 with Sparkling Water 3.30.1.2-1-2.4.) I had faced issues with standalone DRF in 3.28.x as well. But AuoML(DRF) was working fine in 3.28.x
Hope it helps.
Just a thought ..... perhaps there's something with transform method. model.transform(dataframe)
Few more runtime details on XGBoost with more data volume than above one.
These two jobs
provides output as:
Here I am only extracting probabilities not the SHAP values, and it doesn't take much time for XGBoost trained model to transform dataframe. And it's not the case with least amount of data as shown above for AutoML(DRF)/Standalone DRF.
@michalkurka any luck ??
closing that one because of no activity for a long time