SynapseML
SynapseML copied to clipboard
[BUG] Error when using custom Transformer with TabularSHAP in SynapseML
SynapseML version
0.10.1
System information
- Language version (e.g. python 3.8, scala 2.12): python 3.10
- Spark Version (e.g. 3.2.3):
- 3.3.1
- Spark Platform (e.g. Synapse, Databricks)
- Amazon EKS
Describe the problem
Hello,
I encountered an issue when using the TabularSHAP module in SynapseML with a custom Transformer.I received the following error message (AttributeError: SimpleTransformer object has no attribute '_to_java').
I believe this issue may be caused by either a bug in the TabularSHAP implementation or an insufficient implementation of my custom Transformer. Could you please help me determine whether this issue is due to a bug in TabularSHAP or a problem with my custom Transformer.If it is the latter, any suggestions for improving my implementation would be greatly appreciated.
Thank you in advice for your assistance.
Code to reproduce issue
class SimpleTransformer(
Transformer,
HasInputCol,
HasOutputCol,
DefaultParamsReadable,
DefaultParamsWritable,
):
inputCol = Param(
Params._dummy(),
"inputCol",
"inputCol",
)
outputCol = Param(
Params._dummy(),
"outputCol",
"outputCol",
)
num = Param(
Params._dummy(),
"num",
"the Number of putting value",
)
@keyword_only
def __init__(self, inputCol=None, outputCol=None, num=0):
super().__init__()
self._setDefault(num=0)
kwargs = self._input_kwargs
self.setParams(**kwargs)
@keyword_only
def setParams(self, inputCol=None, outputCol=None, num=0):
kwargs = self._input_kwargs
self._set(**kwargs)
def getNum(self):
return self.getOrDefault(self.num)
def _transform(self, dataset):
if not self.isSet("inputCol"):
raise ValueError("No")
input_columns = self.getInputCol()
output_column = self.getOutputCol()
num = self.getNum()
return dataset.withColumn(output_column, F.col(input_columns) + num)
sdf = spark.createDataFrame(
[
[
'iD-01',
1,
1,
'a',
4,
],
[
'iD-02',
2,
2,
'b',
3,
],
[
'iD-03',
3,
3,
'c',
4,
],
[
'iD-04',
0,
0,
'b',
1,
],
*[
[
f'iD-SAMPLE{i}-label1',
1,
1,
'a',
4,
]
for i in range(100)
],
*[
[
f'iD-SAMPLE{i}-label2',
2,
2,
'b',
3,
]
for i in range(100)
],
*[
[
f'iD-SAMPLE{i}-label3',
3,
3,
'c',
4,
]
for i in range(100)
],
*[
[
f'iD-SAMPLE{i}-label0',
0,
0,
'b',
1,
]
for i in range(100)
],
],
schema=['ID', 'colA', 'colB', 'colC', 'colD'],
)
si = StringIndexer(inputCol='colC', outputCol='featured_colC')
st = SimpleTransformer(inputCol="colB", outputCol="newColB", num=1)
va = VectorAssembler(
inputCols=['newColB', 'featured_colC', 'colD'], outputCol='features'
)
model = LightGBMClassifier(
objective="multiclass",
featuresCol="features",
labelCol="colA",
numTasks=3,
useBarrierExecutionMode=True,
categoricalSlotIndexes=[1],
categoricalSlotNames=['featured_colC'],
)
pipeline = Pipeline(stages=[si, st, va, model])
model = pipeline.fit(sdf)
explain_instances = model.transform(sdf)
from pyspark.sql.functions import broadcast, rand
from synapse.ml.explainers import TabularSHAP
shap = TabularSHAP(
inputCols=["colB", "colC", "colD"],
outputCol="shapValues",
numSamples=5000,
model=model,
targetCol="probability",
targetClasses=[1, 2, 3],
backgroundData=broadcast(sdf.orderBy(rand()).limit(100).cache()),
)
# We got some errors, "Attribution Error: `SimpleTransfomer` object has no attribute '_to_java'"
shap_df = shap.transform(explain_instances)
Other info / logs
Attribution Error: SimpleTransfomer object has no attribute '_to_java'
What component(s) does this bug affect?
- [ ]
area/cognitive: Cognitive project - [X]
area/core: Core project - [ ]
area/deep-learning: DeepLearning project - [X]
area/lightgbm: Lightgbm project - [ ]
area/opencv: Opencv project - [ ]
area/vw: VW project - [ ]
area/website: Website - [ ]
area/build: Project build system - [ ]
area/notebooks: Samples under notebooks folder - [ ]
area/docker: Docker usage - [ ]
area/models: models related issue
What language(s) does this bug affect?
- [ ]
language/scala: Scala source code - [ ]
language/python: Pyspark APIs - [ ]
language/r: R APIs - [ ]
language/csharp: .NET APIs - [ ]
language/new: Proposals for new client languages
What integration(s) does this bug affect?
- [ ]
integrations/synapse: Azure Synapse integrations - [ ]
integrations/azureml: Azure ML integrations - [ ]
integrations/databricks: Databricks integrations
Hey @TakuyaInoue-github :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.
Thanks for reporting this @TakuyaInoue-github . Looks like @memoryz is already on the case
@TakuyaInoue-github can you show me the full error stack? I want to understand where is this error coming from.
@memoryz Sure, The following is a stack trace of the error.
Traceback (most recent call last):
File "/mnt/share/example/shap_example_not_working.py", line 172, in <module>
shap_df = shap.transform(explain_instances)
File "/opt/spark/python/pyspark/ml/base.py", line 217, in transform
return self._transform(dataset)
File "/opt/spark/python/pyspark/ml/wrapper.py", line 349, in _transform
self._transfer_params_to_java()
File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/schema/Utils.py", line 131, in _transfer_params_to_java
pair = self._make_java_param_pair(param, self._paramMap[param])
File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/serialize/java_params_patch.py", line 88, in _mml_make_java_param_pair
java_value = _mml_py2java(sc, value)
File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/serialize/java_params_patch.py", line 60, in _mml_py2java
obj = obj._to_java()
File "/opt/spark/python/pyspark/ml/pipeline.py", line 333, in _to_java
java_stages[idx] = stage._to_java()
AttributeError: 'SimpleTransformer' object has no attribute '_to_java'
I hope you find it useful. Thank you.
Hi, is there any update regarding this? I encountered the same error when trying to calculate SHAPs for a SparkXGBClassifier model. Thank you in advance for the information.
Hi, im having the same problem for SparkXGBClassifer! Any updates?