SynapseML
SynapseML copied to clipboard
LightGBMClassifier: issue with featuresShapCol for multiclass model
Hello, I train the multiclass classification model. Target variable consists of 5 classes. Problem: when I call shape values, I obtain 1D vector . The code that I am running on pyspark is:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.jars", "postgresql-42.4.2.jar") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.5-13-d1b51517-SNAPSHOT") \
.getOrCreate()
# -> spark version: 3.3.0
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
from synapse.ml.lightgbm import LightGBMClassifier
features ==> list of features
ohe_features ==> list of categorical features for one-hot encoding
ohe_reordered_features = [f'category_mapped_{it}' for it in ohe_features]
preOheEncoder = StringIndexer(inputCols=ohe_features, outputCols=ohe_reordered_features, handleInvalid='keep')
ohe_output_features = [f'ohe_out_{it}' for it in ohe_features]
oheEncoder = OneHotEncoder(inputCols=ohe_reordered_features, outputCols=ohe_output_features, dropLast=False)
vecAssembler = VectorAssembler(inputCols=[*features, *ohe_output_features], outputCol="features", handleInvalid="keep")
model = LightGBMClassifier(featuresCol='features', labelCol='target', rawPredictionCol='rawPrediction',
featuresShapCol='featuresShap', probabilityCol='probability', weightCol='weight',
numLeaves=4, numIterations=25,
objective='multiclass',
boostingType='gbdt'
)
data_pipeline = Pipeline(stages=[preOheEncoder, oheEncoder, vecAssembler])
train_data = data_pipeline.transform(train_data)
test_data = data_pipeline.transform(test_data)
model = model.fit(train_data)
preds = model.transform(test_data)
When I select shape values I get 1D array, instead of 5D array (as I have 5 classes in target):
# Get shap values:
shap_values = preds.select("featuresShap")
# transform dataframe with pyspark VectorUDT to numpy array
shap_values_np = np.array(shap_values.select(vector_to_array("featuresShap").alias("featuresShap")).limit(100).collect())
print(shap_values_np.shape)
# -> (100, 1, 7010)
# removing unnecessary axis
shap_values_np_3 = np.squeeze(shap_values_np, axis=1)
print(shap_values_np_3.shape)
# -> (100, 7010)
Number of features in 'features' column:
shap_values_pd = shap_values.select(
[vector_to_array("featuresShap").alias("featuresShap"),\
vector_to_array("features").alias("features")])
.limit(100).toPandas()
print(len(shap_values_pd.features[0]))
# -> 1402
Why am I getting a 1D vector for shap values even though the classification is multiclass? Is it because of any bug in my code, or this is as it should be? and then how to interpret correctly such 1D vector with shap values in case of multiclass classification?
Hey @vsilaeva :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
I have same issues as mentioned above, and would like to check if there's any updates and add my code example:
So I have 9 features including numerical features and one-hot encoded categorical features:
shap_feature_names =['Age','Work_Experience','Family_Size','ohe_out_Gender', 'ohe_out_Ever_Married', 'ohe_out_Graduated', 'ohe_out_Profession', 'ohe_out_Spending_Score', 'ohe_out_Var_1']
And I have 4 labels:
Originally I thought for each row in the predicted validation dataset, I will have a list of 4 importance SHAP scores corresponding to 4 labels, and then the shape of importances column (which consists of SHAP values) should be [total_data_rows, 4]. However, I find each row's shap score is a list consists of 140 values.

Next, I checked an example of features column value, although I only have 9 features, but I understand because some of them have been one-hot encoded, so it makes sense the value is a list of 34 values:
I am very confused about how to get mean feature importance SHAP value for each of my 4 labels. Could you please add some documentation about how to determine multiclass feature importance in case of a mixture of numerical and one-hot encoded categorical features?
Thank you very much!