SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

LightGBMClassifier: issue with featuresShapCol for multiclass model

Open vsilaeva opened this issue 2 years ago • 2 comments

Hello, I train the multiclass classification model. Target variable consists of 5 classes. Problem: when I call shape values, I obtain 1D vector . The code that I am running on pyspark is:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("My app") \
    .config("spark.jars", "postgresql-42.4.2.jar") \
    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
    .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.5-13-d1b51517-SNAPSHOT") \
       .getOrCreate()

# ->  spark version: 3.3.0

from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
from synapse.ml.lightgbm import LightGBMClassifier

features ==> list of features
ohe_features ==> list of categorical features for one-hot encoding

ohe_reordered_features = [f'category_mapped_{it}' for it in ohe_features]
preOheEncoder = StringIndexer(inputCols=ohe_features, outputCols=ohe_reordered_features, handleInvalid='keep')

ohe_output_features = [f'ohe_out_{it}' for it in ohe_features]
oheEncoder = OneHotEncoder(inputCols=ohe_reordered_features, outputCols=ohe_output_features, dropLast=False)

vecAssembler = VectorAssembler(inputCols=[*features,  *ohe_output_features],  outputCol="features", handleInvalid="keep")

model = LightGBMClassifier(featuresCol='features', labelCol='target', rawPredictionCol='rawPrediction', 
                           featuresShapCol='featuresShap', probabilityCol='probability', weightCol='weight',
                           numLeaves=4, numIterations=25, 
                           objective='multiclass',
                           boostingType='gbdt'
                          )
data_pipeline = Pipeline(stages=[preOheEncoder, oheEncoder, vecAssembler])
train_data = data_pipeline.transform(train_data)
test_data = data_pipeline.transform(test_data)
model = model.fit(train_data)
preds = model.transform(test_data)

When I select shape values I get 1D array, instead of 5D array (as I have 5 classes in target):

# Get shap values:
shap_values = preds.select("featuresShap")

# transform dataframe with pyspark VectorUDT to numpy array
shap_values_np = np.array(shap_values.select(vector_to_array("featuresShap").alias("featuresShap")).limit(100).collect())
print(shap_values_np.shape)
# ->  (100, 1, 7010)
# removing unnecessary axis
shap_values_np_3 = np.squeeze(shap_values_np, axis=1)
print(shap_values_np_3.shape)
# ->  (100, 7010)

Number of features in 'features' column:
shap_values_pd = shap_values.select(
                               [vector_to_array("featuresShap").alias("featuresShap"),\
                               vector_to_array("features").alias("features")])
                               .limit(100).toPandas()
print(len(shap_values_pd.features[0]))
# -> 1402

Why am I getting a 1D vector for shap values even though the classification is multiclass? Is it because of any bug in my code, or this is as it should be? and then how to interpret correctly such 1D vector with shap values in case of multiclass classification?

vsilaeva avatar Oct 04 '22 23:10 vsilaeva

Hey @vsilaeva :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

github-actions[bot] avatar Oct 04 '22 23:10 github-actions[bot]

I have same issues as mentioned above, and would like to check if there's any updates and add my code example: So I have 9 features including numerical features and one-hot encoded categorical features: shap_feature_names =['Age','Work_Experience','Family_Size','ohe_out_Gender', 'ohe_out_Ever_Married', 'ohe_out_Graduated', 'ohe_out_Profession', 'ohe_out_Spending_Score', 'ohe_out_Var_1'] And I have 4 labels: image

Originally I thought for each row in the predicted validation dataset, I will have a list of 4 importance SHAP scores corresponding to 4 labels, and then the shape of importances column (which consists of SHAP values) should be [total_data_rows, 4]. However, I find each row's shap score is a list consists of 140 values.

image

Next, I checked an example of features column value, although I only have 9 features, but I understand because some of them have been one-hot encoded, so it makes sense the value is a list of 34 values: image

I am very confused about how to get mean feature importance SHAP value for each of my 4 labels. Could you please add some documentation about how to determine multiclass feature importance in case of a mixture of numerical and one-hot encoded categorical features?

Thank you very much!

Haizhuolaojisite avatar Feb 12 '23 21:02 Haizhuolaojisite