deeplake [BUG] Iterating ds.tensorflow() is 200 times slower than ds

🐛🐛 Bug Report

If loading the mnist dataset and iterating the HubCloudDataset it takes 8e-5 seconds per image. If accessing it as a tensorflow dataset (DatasetAdapter) it takes 0.02 seconds per sample, i.e. ~200 times longer.

See this gist for example: https://gist.github.com/daniel-falk/c58eae122acf730607aeeddaf1848229

Am I doing it in the wrong way? In that case the documentation would need to explain better how it should be used.

⚙️ Environment

Python version(s): 3.10 @ Linux, 3.7.14 @ Colab

Sep 18 '22 14:09 daniel-falk

@daniel-falk This is because normal dataset iteration does not download or decompress the samples, but tensorflow iteration does. To get a fair comparison, try doing sample['image'].numpy() in the ds iteration.

Sep 18 '22 16:09 farizrahman4u

@farizrahman4u I see, that was indeed the case. When accessing the tensor it was no significant difference when using .tensorflow() or not.

Is it however expected to be this slow? According to my benchmarks it would take 2 hours to iterate the data for a single epoch on the MNIST dataset (28x28, 60k images).

It does seem like it is the request overhead that takes most time, fetching a data sample with a 484x640x3 frame does only take 15% more time than an 28x28x1 frame. I would expect there to be some faster way to iterate batches of frames other than requesting them one by one? Or perhaps the tensors must always be stored as a single object per tensor in the backend storage? I also benchmarked reading MNIST frames one by one directly from an S3 bucket, that was also very slow, but reading from Hub was still 2.62 times slower than S3 (but variations are quite large).

Benchmarks found here: https://gist.github.com/daniel-falk/ade7e7d18c3b6e1a3e697c8a0a0616ab

Sep 18 '22 19:09 daniel-falk

@daniel-falk indeed current implementation of .tensorflow() is not optimized compared to .pytorch(). Furthermore, our current efforts are focused on making .pytorch() even further optimized dataloader which you can get started here or check the docs.

Just made an example here for you that shows under 5s iteration of MNIST on Deep Lake / Pytorch streamed remotely https://colab.research.google.com/drive/1_-eZJUN5pU0HrW6MZgjOtPZVNsB3q6zD?usp=sharing

Deep Lake Tensorflow integration is on our roadmap for 3.2.0 version, however we would really welcome a contribution to release it faster.

Sep 18 '22 19:09 davidbuniat

@daniel-falk You can get faster iteration by passing fetch_chunks=True to your .numpy() call. We are looking into how we can enable this by default for iteration.

import tqdm
import hub

ds = hub.load('hub://activeloop/mnist-train')

for sample in tqdm.tqdm(ds):
  sample['images'].numpy(fetch_chunks=True)

Above script takes 30s to complete on colab (compared to ~1hr without fetch_chunks=True).

Alternatively you can also do imgs = ds.images.numpy(aslist=True) for small datasets like mnist (this take <20s on colab).

Sep 19 '22 01:09 farizrahman4u

Thank you @farizrahman4u and @davidbuniat, that was exactly what I was looking for! I will look more into Deep Lake / Hub3 dataset and see if that works for me.

Sep 19 '22 06:09 daniel-falk

Closing as fixed by #1887

Oct 18 '22 07:10 farizrahman4u

deeplake deeplake copied to clipboard

[BUG] Iterating ds.tensorflow() is 200 times slower than ds

🐛🐛 Bug Report

⚙️ Environment

deeplake
deeplake copied to clipboard