SynapseML [BUG] Error when using custom Transformer with TabularSHAP in SynapseML

SynapseML version

0.10.1

System information

Language version (e.g. python 3.8, scala 2.12): python 3.10
Spark Version (e.g. 3.2.3):
3.3.1
Spark Platform (e.g. Synapse, Databricks)
Amazon EKS

Describe the problem

Hello,

I encountered an issue when using the TabularSHAP module in SynapseML with a custom Transformer.I received the following error message (AttributeError: SimpleTransformer object has no attribute '_to_java').

I believe this issue may be caused by either a bug in the TabularSHAP implementation or an insufficient implementation of my custom Transformer. Could you please help me determine whether this issue is due to a bug in TabularSHAP or a problem with my custom Transformer.If it is the latter, any suggestions for improving my implementation would be greatly appreciated.

Thank you in advice for your assistance.

Code to reproduce issue

class SimpleTransformer(
    Transformer,
    HasInputCol,
    HasOutputCol,
    DefaultParamsReadable,
    DefaultParamsWritable,
):
    inputCol = Param(
        Params._dummy(),
        "inputCol",
        "inputCol",
    )
    outputCol = Param(
        Params._dummy(),
        "outputCol",
        "outputCol",
    )
    num = Param(
        Params._dummy(),
        "num",
        "the Number of putting value",
    )

    @keyword_only
    def __init__(self, inputCol=None, outputCol=None, num=0):
        super().__init__()
        self._setDefault(num=0)
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, inputCol=None, outputCol=None, num=0):
        kwargs = self._input_kwargs
        self._set(**kwargs)

    def getNum(self):
        return self.getOrDefault(self.num)

    def _transform(self, dataset):
        if not self.isSet("inputCol"):
            raise ValueError("No")

        input_columns = self.getInputCol()
        output_column = self.getOutputCol()
        num = self.getNum()

        return dataset.withColumn(output_column, F.col(input_columns) + num)


sdf = spark.createDataFrame(
    [
        [
            'iD-01',
            1,
            1,
            'a',
            4,
        ],
        [
            'iD-02',
            2,
            2,
            'b',
            3,
        ],
        [
            'iD-03',
            3,
            3,
            'c',
            4,
        ],
        [
            'iD-04',
            0,
            0,
            'b',
            1,
        ],
        *[
            [
                f'iD-SAMPLE{i}-label1',
                1,
                1,
                'a',
                4,
            ]
            for i in range(100)
        ],
        *[
            [
                f'iD-SAMPLE{i}-label2',
                2,
                2,
                'b',
                3,
            ]
            for i in range(100)
        ],
        *[
            [
                f'iD-SAMPLE{i}-label3',
                3,
                3,
                'c',
                4,
            ]
            for i in range(100)
        ],
        *[
            [
                f'iD-SAMPLE{i}-label0',
                0,
                0,
                'b',
                1,
            ]
            for i in range(100)
        ],
    ],
    schema=['ID', 'colA', 'colB', 'colC', 'colD'],
)

si = StringIndexer(inputCol='colC', outputCol='featured_colC')
st = SimpleTransformer(inputCol="colB", outputCol="newColB", num=1)
va = VectorAssembler(
    inputCols=['newColB', 'featured_colC', 'colD'], outputCol='features'
)

model = LightGBMClassifier(
    objective="multiclass",
    featuresCol="features",
    labelCol="colA",
    numTasks=3,
    useBarrierExecutionMode=True,
    categoricalSlotIndexes=[1],
    categoricalSlotNames=['featured_colC'],
)

pipeline = Pipeline(stages=[si, st, va, model])
model = pipeline.fit(sdf)

explain_instances = model.transform(sdf)

from pyspark.sql.functions import broadcast, rand
from synapse.ml.explainers import TabularSHAP

shap = TabularSHAP(
    inputCols=["colB", "colC", "colD"],
    outputCol="shapValues",
    numSamples=5000,
    model=model,
    targetCol="probability",
    targetClasses=[1, 2, 3],
    backgroundData=broadcast(sdf.orderBy(rand()).limit(100).cache()),
)

# We got some errors, "Attribution Error: `SimpleTransfomer` object has no attribute '_to_java'"
shap_df = shap.transform(explain_instances)

Other info / logs

Attribution Error: SimpleTransfomer object has no attribute '_to_java'

What component(s) does this bug affect?

[ ] area/cognitive: Cognitive project
[X] area/core: Core project
[ ] area/deep-learning: DeepLearning project
[X] area/lightgbm: Lightgbm project
[ ] area/opencv: Opencv project
[ ] area/vw: VW project
[ ] area/website: Website
[ ] area/build: Project build system
[ ] area/notebooks: Samples under notebooks folder
[ ] area/docker: Docker usage
[ ] area/models: models related issue

What language(s) does this bug affect?

[ ] language/scala: Scala source code
[ ] language/python: Pyspark APIs
[ ] language/r: R APIs
[ ] language/csharp: .NET APIs
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/synapse: Azure Synapse integrations
[ ] integrations/azureml: Azure ML integrations
[ ] integrations/databricks: Databricks integrations

Apr 25 '23 04:04 TakuyaInoue-github

Hey @TakuyaInoue-github :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

Apr 25 '23 04:04 github-actions[bot]

Thanks for reporting this @TakuyaInoue-github . Looks like @memoryz is already on the case

May 01 '23 17:05 mhamilton723

@TakuyaInoue-github can you show me the full error stack? I want to understand where is this error coming from.

May 02 '23 07:05 memoryz

@memoryz Sure, The following is a stack trace of the error.

Traceback (most recent call last):
  File "/mnt/share/example/shap_example_not_working.py", line 172, in <module>
    shap_df = shap.transform(explain_instances)
  File "/opt/spark/python/pyspark/ml/base.py", line 217, in transform
    return self._transform(dataset)
  File "/opt/spark/python/pyspark/ml/wrapper.py", line 349, in _transform
    self._transfer_params_to_java()
  File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/schema/Utils.py", line 131, in _transfer_params_to_java
    pair = self._make_java_param_pair(param, self._paramMap[param])
  File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/serialize/java_params_patch.py", line 88, in _mml_make_java_param_pair
    java_value = _mml_py2java(sc, value)
  File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/serialize/java_params_patch.py", line 60, in _mml_py2java
    obj = obj._to_java()
  File "/opt/spark/python/pyspark/ml/pipeline.py", line 333, in _to_java
    java_stages[idx] = stage._to_java()
AttributeError: 'SimpleTransformer' object has no attribute '_to_java'

I hope you find it useful. Thank you.

May 09 '23 04:05 TakuyaInoue-github

Hi, is there any update regarding this? I encountered the same error when trying to calculate SHAPs for a SparkXGBClassifier model. Thank you in advance for the information.

Jul 18 '23 09:07 kappanful

Hi, im having the same problem for SparkXGBClassifer! Any updates?

Nov 08 '23 15:11 AlejandroGVC

SynapseML SynapseML copied to clipboard

[BUG] Error when using custom Transformer with TabularSHAP in SynapseML

SynapseML version

System information

Describe the problem

Code to reproduce issue

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

SynapseML
SynapseML copied to clipboard