deeplake
deeplake copied to clipboard
[BUG] Delay at the beginning of each epoch with num_workers > 0
🐛🐛 Bug Report
⚗️ Current Behavior
When I set num_workers > 0 with pytorch there is a pause of ~90sec in the beginning of each epoch before the first batch is processed. At first I though it is a problem with pytorch lightning but after creating a minimal training loop from your docs I experience the same problem. I'm training on Google Colab with the hub 2.5.2
Expected behavior/code No or only little delay at the beginning of each epoch
⚙️ Environment
Pythonversion(s):- 3.7
OS: Linux - Google ColabIDE: Jupyter
Hey @pietz, I'm looking into this, could you let me know whether shuffle=True or False in your experiments?
It's True. Let me check the behaviour without.
@AbhinavTuli Turning shuffle off does fix the problem!
I see. So the reason for this is most likely that we create a shuffle buffer and fill it up with elements before iteration starts. The default size of this is 2GB (but can be overridden) which could be why it takes 90 seconds for iteration to start for you. We just merged a PR yesterday that would make this behavior more obvious.
Thanks mate!
Could you elaborate in which direction (smaller or bigger) I should change the buffer size to accelerate it? Will accelerating it initially lead to a slowdown later in the epoch?
Do you have a few tips for dealing with high resolution datasets when only a small random crop is needed during each iteration?
I'm thinking about writing a custom collate function so that I can take N different crops from each image before fusing it into a batch. For example instead of 32 samples leading to a batch size of 32 I could take 8 images with 4 random crops each and merge it into a batch of 32. That should accelerate training quite a bit.
Hi @pietz sorry for late follow, please see answers below
Could you elaborate in which direction (smaller or bigger) I should change the buffer size to accelerate it? Will accelerating it initially lead to a slowdown later in the epoch?
here is a documentation explaining how shuffling works (https://docs.activeloop.ai/how-hub-works/shuffling-in-ds.pytorch). If you want to kickstart training faster than making shuffle buffer smaller would help, however it would significantly impact the randomness of the order, hence might impact the accuracy/generalization. We are working on set of new APIs as discussed here https://github.com/activeloopai/Hub/issues/1700, for supporting custom order but still in the works.
Do you have a few tips for dealing with high resolution datasets when only a small random crop is needed during each iteration?
When you store high resolution images inside a tensor let's say aerial images, we typically do tiling behind the scenes to efficiently store the data. So if you can restrict the access of the random crop to be inside a tile, then it could be up to 4x faster. How large are the images you are working with?
I'm thinking about writing a custom collate function so that I can take N different crops from each image before fusing it into a batch. For example instead of 32 samples leading to a batch size of 32 I could take 8 images with 4 random crops each and merge it into a batch of 32. That should accelerate training quite a bit.
Typically, I would use a transform (https://docs.activeloop.ai/getting-started/parallel-computing) to generate the dataset, and only then use cropped images to train the model. In this case, cropping will be already stored in newly created dataset. Would this work for your use case?
Thanks for your answer. The buffer size seems to be what I was looking for. Having "less" randomness for faster loading speeds is exactly what I'm looking for.
The tiling also sounds very interesting. I don't understand what I need to do on my part in order to get it going though. I'm using a transform function that also includes the random crop operation. Would this be enough so the dataset object magically "knows" which pixel information to pull? The dataset has varying resolutions. The max is 36MP.
Tiling or cropping the images before creating the dataset is not something I really want to do. I might do it at some point but I do consider it a heavy workaround. Programmatically speaking it should be something that's optimized behind the scenes so the dataset stays true to its original shape.
Thank you @davidbuniat for your help!
Hey @pietz. The tiling does automatically figure out what information to pull but we currently don't have a good way to do so in our PyTorch implementation but we do plan on supporting it in the future. As a work around, you can write a custom dataloader that accesses a slice of the images inside using something similar to ds.images[i, 0:200, 0:200].numpy() which will only access the relevant tiles of the image.