deeplake
deeplake copied to clipboard
[BUG] Iterating ds.tensorflow() is 200 times slower than ds
🐛🐛 Bug Report
If loading the mnist dataset and iterating the HubCloudDataset it takes 8e-5 seconds per image. If accessing it as a tensorflow dataset (DatasetAdapter) it takes 0.02 seconds per sample, i.e. ~200 times longer.
See this gist for example: https://gist.github.com/daniel-falk/c58eae122acf730607aeeddaf1848229
Am I doing it in the wrong way? In that case the documentation would need to explain better how it should be used.
⚙️ Environment
Pythonversion(s): 3.10 @ Linux, 3.7.14 @ Colab
@daniel-falk This is because normal dataset iteration does not download or decompress the samples, but tensorflow iteration does. To get a fair comparison, try doing sample['image'].numpy() in the ds iteration.
@farizrahman4u I see, that was indeed the case. When accessing the tensor it was no significant difference when using .tensorflow() or not.
Is it however expected to be this slow? According to my benchmarks it would take 2 hours to iterate the data for a single epoch on the MNIST dataset (28x28, 60k images).
It does seem like it is the request overhead that takes most time, fetching a data sample with a 484x640x3 frame does only take 15% more time than an 28x28x1 frame. I would expect there to be some faster way to iterate batches of frames other than requesting them one by one? Or perhaps the tensors must always be stored as a single object per tensor in the backend storage? I also benchmarked reading MNIST frames one by one directly from an S3 bucket, that was also very slow, but reading from Hub was still 2.62 times slower than S3 (but variations are quite large).
Benchmarks found here: https://gist.github.com/daniel-falk/ade7e7d18c3b6e1a3e697c8a0a0616ab
@daniel-falk indeed current implementation of .tensorflow() is not optimized compared to .pytorch(). Furthermore, our current efforts are focused on making .pytorch() even further optimized dataloader which you can get started here or check the docs.
Just made an example here for you that shows under 5s iteration of MNIST on Deep Lake / Pytorch streamed remotely https://colab.research.google.com/drive/1_-eZJUN5pU0HrW6MZgjOtPZVNsB3q6zD?usp=sharing
Deep Lake Tensorflow integration is on our roadmap for 3.2.0 version, however we would really welcome a contribution to release it faster.
@daniel-falk You can get faster iteration by passing fetch_chunks=True to your .numpy() call. We are looking into how we can enable this by default for iteration.
import tqdm
import hub
ds = hub.load('hub://activeloop/mnist-train')
for sample in tqdm.tqdm(ds):
sample['images'].numpy(fetch_chunks=True)
Above script takes 30s to complete on colab (compared to ~1hr without fetch_chunks=True).
Alternatively you can also do imgs = ds.images.numpy(aslist=True) for small datasets like mnist (this take <20s on colab).
Thank you @farizrahman4u and @davidbuniat, that was exactly what I was looking for! I will look more into Deep Lake / Hub3 dataset and see if that works for me.
Closing as fixed by #1887