model-analysis icon indicating copy to clipboard operation
model-analysis copied to clipboard

Error in merge_accumulators when using keras metrics on dataflow

Open zywind opened this issue 2 years ago • 3 comments

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow Model Analysis): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): GCP Dataflow Apache Beam Python 3.7 SDK 2.39.0
  • TensorFlow Model Analysis installed from (source or binary): binary
  • TensorFlow Model Analysis version (use command below): 0.33
  • Python version: 3.7
  • Jupyter Notebook version: Jupyter lab 3.2.8
  • Exact command to reproduce:

I am using TFX's evaluator

eval_config = tfma.EvalConfig(
  model_specs=model_specs,
  metrics_specs=tfma.metrics.specs_from_metrics([
      tf.keras.metrics.AUC(curve='ROC', name='ROCAUC'),
      tf.keras.metrics.AUC(curve='PR', name='PRAUC'),
      tf.keras.metrics.Precision(),
      tf.keras.metrics.Recall(),
      tf.keras.metrics.BinaryAccuracy(),
    ]),
  slicing_specs=slicing_specs
)

evaluator = Evaluator(
  eval_config=eval_config,
  model=model,
  examples=transform_examples,
)

context.run(evaluator)

Describe the problem

Running the same evaluation using Beam's DirectRunner locally will not cause any error, but whenever I run it on dataflow and when dataflow spawns more than one worker, I get an error like so:

output.with_value(self.phased_combine_fn.apply(output.value)): File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/combiners.py", line 882, in merge_only return self.combine_fn.merge_accumulators(accumulators) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in merge_accumulators a in zip(self._combiners, zip(*accumulators_batch)) File "/home/sandbox/.pex/install/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl.06f7ceb62380d1c704d774a5096a04f953de60c9/apache_beam-2.39.0-cp37-cp37m-linux_x86_64.whl/apache_beam/transforms/combiners.py", line 665, in a in zip(self._combiners, zip(*accumulators_batch)) File "/usr/local/lib/python3.7/site-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 560, in merge_accumulators for metric_index in range(len(self._metrics[output_name])): TypeError: 'NoneType' object is not subscriptable

Based on the dataflow log, the failing steps were:

  • ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PreCombineFn)/Combine
  • ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PreCombineFn)/GroupByKey
  • ExtractEvaluateAndWriteResults/ExtractAndEvaluate/EvaluateMetricsAndPlots/ComputeMetricsAndPlots()/CombineMetricsPerSlice/CombinePerKey(PostCombineFn)/GroupByKey

I see that you have this commit, which appears to be addressing this problem, but it is immediately rolled back. I wonder if you have had similar issues and what would you recommend to fix the error.

zywind avatar Aug 20 '22 21:08 zywind

I tried setting Dataflow's max_num_workers to 1 and the job succeeded. Looks like the problem is indeed in running dataflow with multiple workers.

zywind avatar Aug 20 '22 23:08 zywind

Hi @zywind ,

As mentioned here, for distributed evaluation, we use tfma.ExtractEvaluateAndWriteResults. Please refer to this example notebook let me know if this resolves your issue.

Thank you.

singhniraj08 avatar Aug 22 '22 13:08 singhniraj08

Hi @singhniraj08,

I'm using the official TFX Evaluator, which internally uses tfma.ExtractEvaluateAndWriteResults as you can see here.

zywind avatar Aug 22 '22 14:08 zywind