model-analysis
model-analysis copied to clipboard
Error in merge_accumulators when using keras metrics on dataflow
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow Model Analysis): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): GCP Dataflow Apache Beam Python 3.7 SDK 2.39.0
- TensorFlow Model Analysis installed from (source or binary): binary
- TensorFlow Model Analysis version (use command below): 0.33
- Python version: 3.7
- Jupyter Notebook version: Jupyter lab 3.2.8
- Exact command to reproduce:
I am using TFX's evaluator
eval_config = tfma.EvalConfig(
model_specs=model_specs,
metrics_specs=tfma.metrics.specs_from_metrics([
tf.keras.metrics.AUC(curve='ROC', name='ROCAUC'),
tf.keras.metrics.AUC(curve='PR', name='PRAUC'),
tf.keras.metrics.Precision(),
tf.keras.metrics.Recall(),
tf.keras.metrics.BinaryAccuracy(),
]),
slicing_specs=slicing_specs
)
evaluator = Evaluator(
eval_config=eval_config,
model=model,
examples=transform_examples,
)
context.run(evaluator)
Describe the problem
Running the same evaluation using Beam's DirectRunner locally will not cause any error, but whenever I run it on dataflow and when dataflow spawns more than one worker, I get an error like so:
output.with_value(self.phased_combine_fn.apply(output.value)): File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/combiners.py", line 882, in merge_only return self.combine_fn.merge_accumulators(accumulators) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in merge_accumulators a in zip(self._combiners, zip(*accumulators_batch)) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in
a in zip(self._combiners, zip(*accumulators_batch)) File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 560, in merge_accumulators for metric_index in range(len(self._metrics[output_name])): TypeError: 'NoneType' object is not subscriptable
Based on the dataflow log, the failing steps were:
- ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PreCombineFn)/Combine
- ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PreCombineFn)/GroupByKey
- ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PostCombineFn)/GroupByKey
I see that you have this commit, which appears to be addressing this problem, but it is immediately rolled back. I wonder if you have had similar issues and what would you recommend to fix the error.
I tried setting Dataflow's max_num_workers to 1 and the job succeeded. Looks like the problem is indeed in running dataflow with multiple workers.
Hi @zywind ,
As mentioned here, for distributed evaluation, we use tfma.ExtractEvaluateAndWriteResults
. Please refer to this example notebook let me know if this resolves your issue.
Thank you.
Hi @singhniraj08,
I'm using the official TFX Evaluator, which internally uses tfma.ExtractEvaluateAndWriteResults as you can see here.