tfx Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU

System information

Have I specified the code to reproduce the issue (Yes, No): No
Environment in which the code is executed: Dataflow on Google Cloud. n1-highmem-8, Nvidia T4 or P100 (both give the same error).
TensorFlow version: 2.11.0
TFX Version: 1.12.0
Python version: 3.7
Python dependencies (Dockerfile submitted to TFX):

FROM tensorflow/tfx:1.12.0

RUN pip3 install --upgrade --no-cache-dir pip \
    tensorflow-text==2.11.0 \
    tensorflow-recommenders==0.7.2 \
    scann==1.2.9

Describe the current behavior I am using the TFX BulkInferrer to apply a model with an Xception and BERT transform layer to a dataset of 2.5 million Examples with image and text features. After running for 7h and processing on Dataflow an OOM error is triggered.

ResourceExhaustedError: Graph execution error: OOM when allocating tensor with shape[512,128,167,167] and type float on /job:localhost/replica:0/task:0/device:GPU:0 
by allocator GPU_0_bfc [[{{node xception/block2_sepconv1/separable_conv2d}}]] 
…
OOM when allocating tensor with shape[448,128,167,167] and 
type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

We see the error happens in the GPU (device:GPU:0) in the xception model (node xception/block2_sepconv1/separable_conv2d) when trying to process large batches (shape[512,... and shape[448,...).

512*128*167*167 = 1827733504

That is a tensor with 1.8 billion floating point values, with 32-bit precision (4 bytes) should be (1.8e9 * 4bytes = 7.3GB). A single allocation attempt like that could fail on a GPU with 16GB.

Describe the expected behavior

The Beam BatchExecute algorithm should constrain the dynamic batch size to sizes less than 512 or 448 in order to fit onto the 16GB of GPU ram. The OOM happens on the "train" split (80% of the data) after hours of processing. On the smaller "eval" split (10%) the bulkInferrer succeeds. From the Dataflow metrics the batchsize_MAX was 256.

Standalone code to reproduce the issue The issue is data dependent. It is a basic BulkInferrer with imported examples and an imported model. Relevant Beam Args:

    "--runner=DataflowRunner",
    "--disk_size_gb=50",
    "--machine_type=n1-highmem-8", 
    "--experiments=use_runner_v2",
    "--experiments=worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver",
    # "--experiments=worker_accelerator=type:nvidia-tesla-p100;count:1;install-nvidia-driver",
    "--experiments=no_use_multiple_sdk_containers",

Other info / logs

Here are the logs.

Using a bottom up search of all the python virtual env source files, I searched for the function names in the failed step name highlighted on the Dataflow job-graph: RunInference[train]/RunInference/RunInferenceImpl/BulkInference/BatchElements/ParDo(_GlobalWindowsBatchingDoFn):

_GlobalWindowsBatchingDoFn is only called here in BatchElements
From BatchElements doscs has a parameter max_batch_size
BatchElements is called here in ml/inference/base.py::RunInference
Directly above is an interesting TODO to add a batch_size back off with a link to an open github issue . It mentions “Add batch_size back off in the case there are functional reasons large batch sizes cannot be handled.” This looks like my problem too.
bulk_infferer/executor.py calls tfx_bsl.public.beam RunInference, which delegates to RunInferenceImp
This calls the previously identified base.Runinference, but adds ‘BulkInference’ a text description
It also passes in a ModelHandler, which is responsible for providing the BatchElements kwargs
Since we configure the bulk_inferrer for Prediction, it will create a model_hanlder for in_process_inference using _get_saved_model_handler(), which should select PREDICTION from our inference spec
A _PredictModelHandler will be created. Neither it, nor its two TFX base classes (_BaseSavedModelHandler, _BaseModelHandler) override the base.ModelHandler.batch_elements_kwargs(), so the default empty dictionary will be provided.
This means the default max batch size of 10000 will be used in combination with whatever fancy adaptive batch size logic beam.BatchElements() uses.
This adaptive logic presumably has a bug which can cause the batch size to grow too large, causing the OOM. The previously mentioned open github issue confirms this suspicion.

Mar 09 '23 06:03 IzakMaraisTAL

A possible solution might be to expose the max-batch-size setting on the BulkInferrer Inference spec proto and pass it all the way through. If I had a way of fixing the max batch size to 256, it should work.

Mar 09 '23 06:03 IzakMaraisTAL

@IzakMaraisTAL, This issue looks like a feature request. Thank you for bringing this up!

@lego0901, Please have a look into this feature request to expose max-batch-size setting on the BulkInferrer Inference spec proto. Thanks!

Mar 13 '23 09:03 singhniraj08

Ack. Thanks for your request and providing very abundant studies!!

Mar 14 '23 01:03 lego0901

I am repeatedly getting this same bug with the Transform component too.

Node: 'xception/block2_sepconv1/separable_conv2d' OOM when allocating tensor with shape[512,128,167,167] and type float. Full logs: downloaded-logs-20230413-082707.json.zip

This is using the same dataset (2.5M images) and pre-processing layer (creating embeddings by passing the images through Xception).

Splitting the dataset into 12 (180k examples each) and running a separate Transform for each resulted in 11 of the 12 Transforms passing and one failing with a similar OOM problem. But this workaround is very manual and makes further processing more difficult.

I don't agree with this issue being classified as a feature. TFX is a scalable stream processing framework for ML. If it fails due to increase in dataset size and incorrect usage of (or an underlying bug in) Beam, that is still a bug.

The configurable maximum batch size bug-fix suggested above will need to be exposed to the Transform component too. An alternative fix would be for Beam itself to take the available GPU memory into account when determining how much to increase its batch size.

Apr 13 '23 06:04 IzakMaraisTAL

Splitting the dataset into 12 (180k examples each) and running a separate Transform for each resulted in 11 of the 12 Transforms passing and one failing with a similar OOM problem.

After trying various workarounds like this (none, which worked 100%), I am on CPU in stead of GPU as the only reliable option. This increases the costs of the transform from $100 to $300.

Apr 24 '23 10:04 IzakMaraisTAL

My bad.. I will bump up this issue to our side and try to figure out the solution. Sorry for your inconvenience.

Apr 25 '23 01:04 lego0901

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

Jul 18 '23 16:07 iindyk

That sounds promising, thank you.

I would very much like to upgrade, but unfortunately I am blocked by https://github.com/tensorflow/recommenders/issues/671. Once that is resolved, I will give feedback.

Jul 19 '23 05:07 IzakMaraisTAL

Depending on how exactly you use transform and BulkInferrer you may also be able to set data (tfxio) source batch size. Or, if you use the instance dict format with transform, then you can also set it through transform context.

Jul 19 '23 15:07 iindyk

Thanks for the tips. I would like to apply them to the Transform component.

Depending on how exactly you use transform and BulkInferrer you may also be able to set data (tfxio) source batch size.

I instantiate a TFX Transform component as described in its documentation and provide it in the list of components to the pipeline class. The input to the Transform is component is channel of serialized examples. I'm not sure how one would leverage tfxio there.

Or, if you use the instance dict format with transform, then you can also set it through transform context.

The TFX Transform component constructor does not expose the transform context. I can see the desired_batch_size is set inside a context here inside the TransformProcesser, which is instantiated from the Exector::Do() here. Neither the TransformProcessor nor the Executor look customisable. The value for the desired_batch_size will be None (dynamic batch size).

Jul 20 '23 13:07 IzakMaraisTAL

Yes, you're right, the component itself does not expose the parameter. Even if we were to add it, it would be available at an even later tfx version than the byte size-based batching. So, unfortunately, updating and using the flag seems like the only option.

Jul 20 '23 22:07 iindyk

You could try creating a custom component based on transform that overrides the parameter, but that may be pretty involving.

Jul 20 '23 22:07 iindyk

Thanks for the confirmation, will let you know once we test 1.13 after a compatible ScaNN release has been made.

Jul 21 '23 11:07 IzakMaraisTAL

A new release of ScaNN is available but it looks like they skipped tensorflow 2.12 altogether and went from 2.11 to 2.13. I will wait for a future release of TFX that depends on tensorflow 2.13 to test the changes.

Aug 18 '23 08:08 IzakMaraisTAL

~~While trying to test this in 1.14.0, I got blocked by https://github.com/tensorflow/tfx/issues/6335.~~ Resolved.

I have upgraded to TFX 1.14.0. This wil be tested in the next scheduled run of the pipeline at the start of November.

Oct 02 '23 12:10 IzakMaraisTAL

The upgrade to TFX 1.14.0 was held back by https://github.com/tensorflow/tfx/issues/6386. I am now applying the workaround mentioned there and should then have results after the next scheduled run at the start of Feb.

Jan 09 '24 12:01 IzakMaraisTAL

Unfortunately the fix does not work. The Transform component running TFX 1.14.0 ran out of memory on the 16GB GPU in exactly the same way as described previously.

Feb 09 '24 13:02 IzakMaraisTAL

In tfx 1.13 we introduced a new batching mode that tries to deserialize data in batches of ~ 100MB. It can be enabled with tfxio_use_byte_size_batching flag. Could you try updating to 1.13 and setting the flag to True?

I see now for the failed TFX 1.14.0 run, I did not set the new flag as requested above. I will investigate how to set global absl flags and re-try.

Feb 12 '24 05:02 IzakMaraisTAL

I added the flag in the Transform component's

tfx.components.Transform(<args>).with_beam_pipeline_args([<other args>, "--tfxio_use_byte_size_batching"]])

In a test using the local tfx runner I could confirm that the flag value of True is propagated to my preprocessing_fn() by adding:

print("tfxio_use_byte_size_batching value",  flags.FLAGS.get_flag_value("tfxio_use_byte_size_batching", False) )

Is this correct, or is there a better way to set the flag this?

When running the TFX pipeline with the full dataset on Vertex AI, delegating the Transform to Dataflow on a 16GB GPU I no longer get the error message described above but the Dataflow job still fails after a series of resource allocation errors that try to assign > 20GB. Here is the first one I could find:

Error processing instruction process_bundle-8858068838784883165-1941. Original traceback is
Traceback (most recent call last):
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/beam/impl.py\", line 358, in _handle_batch
    result = self._graph_state.callable_get_outputs(feed_dict)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/saved/saved_transform_io_v2.py\", line 377, in apply_transform_model
    return self._apply_v2_transform_model_finalized(logical_input_map)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow_transform/saved/saved_transform_io_v2.py\", line 301, in _apply_v2_transform_model_finalized
    return self._wrapped_function_finalized(modified_inputs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1184, in __call__
    return self._call_impl(args, kwargs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1193, in _call_impl
    return self._call_with_structured_signature(args, kwargs)
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1270, in _call_with_structured_signature
    return self._call_flat(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py\", line 1349, in _call_flat
    return self._build_call_outputs(self._inference_function(*args))
  File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py\", line 196, in __call__
    outputs = self._bound_context.call_function(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/context.py\", line 1457, in call_function
    outputs = execute.execute(
      File \"/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py\", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error:

Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
Node: 'StatefulPartitionedCall'
Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
Node: 'StatefulPartitionedCall'
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED:  Out of memory while trying to allocate 30909005824 bytes.
\t [[{{node StatefulPartitionedCall}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

\t [[StatefulPartitionedCall/map/while/body/_576/map/while/Shape_1/_199]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED:  Out of memory while trying to allocate 30909005824 bytes.
\t [[{{node StatefulPartitionedCall}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_wrapped_finalized_42120]

As mentioned above, the Transform component sometimes passes if the example count is reduced, so I suspect the problem is still be tied to dynamic batch size growth in some way.

Here is the log downloaded-logs-20240213-071807.json.zip.

In case it is useful, here is the source code for my preprocessing_fn(). It extracts image embeddings using xception and text embeddings using sentence-tf-base.

What do you suggest @lego0901 and @iindyk ?

Feb 13 '24 06:02 IzakMaraisTAL

@lego0901 @iindyk any thoughts or insights on this? Just now, a TFX 1.12 pipeline started failing for the exact same reason, even though it worked before. Any indication it will be fixed in TFX 1.15?

Mar 12 '24 09:03 axeltidemann

since the OOM happens when applying the model and setting the tfxio_use_byte_size_batching value did not help it could be the case that input batch is small enough (batching happens on input batches), but the transformation in the preprocessing_fn makes it too large (this case is not easy to detect in transform since we need to apply the transformation to know the output size). A hacky way to deal with this in your case could be at the module-level of the file with preprocessing_fn add:

import tensorflow_transform.beam as tft_beam

tft_beam.Context.get_desired_batch_size = lambda _ : 100

it's ugly, but should help if the problem is in the produced batch size until we have a better solution

Mar 12 '24 20:03 iindyk

Interesting, will try that out and report back.

Mar 13 '24 07:03 axeltidemann

The above suggestion did not work.

I see we also set tf.config.experimental.set_memory_growth(device, True). Could that have interfered with this suggested fix (or the previous use_byte_size_batching fix)?

Applied to Transform component preprocessing_fn:

def preprocessing_fn(inputs):
    tft_beam.Context.get_desired_batch_size = lambda _: 100

    gpu_devices = tf.config.experimental.list_physical_devices("GPU")
    for device in gpu_devices:
        try:
            tf.config.experimental.set_memory_growth(device, True)
        except Exception as e:
            print(f'Ignoring: \n"{e}" \nCannot set memory growth.')
    ...

From the Dataflow worker logs:

2024-05-06 09:00:08.084986: W tensorflow/core/framework/op_kernel.cc:1828] OP_REQUIRES failed at conv_ops_impl.h:370 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[532,128,147,147] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Logs .json.zip

UPDATE: removing tf.config.experimental.set_memory_growth and retrying both the above and the previous fix still resulted in OOM on GPU after Dataflow has been running for about 1h. The specific message is slightly different though downloaded-logs-20240507-141323.json.zip.

May 06 '24 09:05 IzakMaraisTAL

tfx tfx copied to clipboard

Transform and BulkInferrer dynamic batch size grows too large, causing OOM on 16GB GPU

tfx
tfx copied to clipboard