deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[BUG] Iterating ds.tensorflow() is 200 times slower than ds

Open daniel-falk opened this issue 3 years ago • 5 comments

🐛🐛 Bug Report

If loading the mnist dataset and iterating the HubCloudDataset it takes 8e-5 seconds per image. If accessing it as a tensorflow dataset (DatasetAdapter) it takes 0.02 seconds per sample, i.e. ~200 times longer.

See this gist for example: https://gist.github.com/daniel-falk/c58eae122acf730607aeeddaf1848229

Am I doing it in the wrong way? In that case the documentation would need to explain better how it should be used.

⚙️ Environment

  • Python version(s): 3.10 @ Linux, 3.7.14 @ Colab

daniel-falk avatar Sep 18 '22 14:09 daniel-falk

@daniel-falk This is because normal dataset iteration does not download or decompress the samples, but tensorflow iteration does. To get a fair comparison, try doing sample['image'].numpy() in the ds iteration.

farizrahman4u avatar Sep 18 '22 16:09 farizrahman4u

@farizrahman4u I see, that was indeed the case. When accessing the tensor it was no significant difference when using .tensorflow() or not.

Is it however expected to be this slow? According to my benchmarks it would take 2 hours to iterate the data for a single epoch on the MNIST dataset (28x28, 60k images).

It does seem like it is the request overhead that takes most time, fetching a data sample with a 484x640x3 frame does only take 15% more time than an 28x28x1 frame. I would expect there to be some faster way to iterate batches of frames other than requesting them one by one? Or perhaps the tensors must always be stored as a single object per tensor in the backend storage? I also benchmarked reading MNIST frames one by one directly from an S3 bucket, that was also very slow, but reading from Hub was still 2.62 times slower than S3 (but variations are quite large).

Benchmarks found here: https://gist.github.com/daniel-falk/ade7e7d18c3b6e1a3e697c8a0a0616ab

daniel-falk avatar Sep 18 '22 19:09 daniel-falk

@daniel-falk indeed current implementation of .tensorflow() is not optimized compared to .pytorch(). Furthermore, our current efforts are focused on making .pytorch() even further optimized dataloader which you can get started here or check the docs.

Just made an example here for you that shows under 5s iteration of MNIST on Deep Lake / Pytorch streamed remotely https://colab.research.google.com/drive/1_-eZJUN5pU0HrW6MZgjOtPZVNsB3q6zD?usp=sharing

Deep Lake Tensorflow integration is on our roadmap for 3.2.0 version, however we would really welcome a contribution to release it faster.

davidbuniat avatar Sep 18 '22 19:09 davidbuniat

@daniel-falk You can get faster iteration by passing fetch_chunks=True to your .numpy() call. We are looking into how we can enable this by default for iteration.

import tqdm
import hub

ds = hub.load('hub://activeloop/mnist-train')

for sample in tqdm.tqdm(ds):
  sample['images'].numpy(fetch_chunks=True)

Above script takes 30s to complete on colab (compared to ~1hr without fetch_chunks=True).

Alternatively you can also do imgs = ds.images.numpy(aslist=True) for small datasets like mnist (this take <20s on colab).

farizrahman4u avatar Sep 19 '22 01:09 farizrahman4u

Thank you @farizrahman4u and @davidbuniat, that was exactly what I was looking for! I will look more into Deep Lake / Hub3 dataset and see if that works for me.

daniel-falk avatar Sep 19 '22 06:09 daniel-falk

Closing as fixed by #1887

farizrahman4u avatar Oct 18 '22 07:10 farizrahman4u