tensorflow TF Dataset performance regression / best practices for data augmentation on the accelerator

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win 10
TensorFlow version (use command below): 2.3.0
Python version: 3.6.9
CUDA/cuDNN version: 10.1
GPU model and memory: P100 (colab)

Describe the current behaviour Iterating over a simple Dataset is very slow (even when using a @tf.function). It is 4 (CPU) - 6 (GPU) times slower than a comparable iteration using numpy.

Describe the expected behaviour I would expect much faster execution of such a simple code, especially because of the optimisations that could be done on the computational graph. It should at least be on par with iterating in Python over numpy data / operations. I can imagine some share of the GPU overhead is attributable to the CUDA kernel launch overhead. I am working on a much more complex dataset structure (dealing witch random sequences + a batch of random additional data per sequence) and was surprised that a big part of the slowness of my structure can be attributed to the simple example presented here.

I could probably create a dense dataset (with 100x or more (depending on sequence length) in memory footprint) to increase the speed but I believe this kind of data augmentation (different sequence offsets) should be done on the accelerator to not blow up the memory footprint unnecessary. I did not find any best practices regarding this very common problem I am very much following the philosophy (would rather call it "hacking") in the tensorflow function "timeseries_dataset_from_array": https://github.com/tensorflow/tensorflow/blob/fcc4b966f1265f466e82617020af93670141b009/tensorflow/python/keras/preprocessing/timeseries.py#L30

Standalone code to reproduce the issue https://colab.research.google.com/drive/1j14DvChu7FJDyD6D8aZPb7w-h4mdTEVZ?usp=sharing

import tensorflow as tf
import numpy as np

d = 1000
z = tf.zeros((d, d), dtype=tf.float32)
ds = tf.data.Dataset.range(100).map(lambda x: z[:10, :10]).repeat().batch(256).take(1000)
@tf.function
def run():
  s = 0
  for x in ds:
    s = tf.reduce_sum(x)
    pass

run()

a = np.ones((d, d), dtype=np.float32)
s = 0
for _ in range(1000):
  for __ in range(256):
    s += np.sum(a[:10, :10])

Nov 14 '20 14:11 jonas-eschmann

Was able to reproduce the issue with TF v2.3. Please find the gist of it here. Thanks!

Nov 17 '20 08:11 amahendrakar

@jonas-eschmann What is your scope with s += np.sum(a[:10, :10])?

Nov 17 '20 13:11 bhack

@bhack I wanted to make sure the pipeline is bottlenecked by the dataset throughput and not the reduce_sum.

Nov 17 '20 14:11 jonas-eschmann

Do you want to apply transformations to the dataset elements?

Nov 17 '20 15:11 bhack

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

Aug 20 '24 06:08 Venkat6871

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

Aug 28 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Sep 04 '24 01:09 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Sep 04 '24 01:09 google-ml-butler[bot]