Subhash Ramesh
Subhash Ramesh
same here
Hi @tmbdev , thanks for the reply! The EKS nodes themselves (p4d.24xlarge) have ~1 TB+ of memory, and each pod doesn't have any cpu/memory limits since each pod is assigned...
@tmbdev When using Webdataset to train in a distributed setup with a large dataset that is streamed from cloud storage, what would be the recommended way to set this up,...
Hi @tmbdev , so in my Webdataset setup, I've been using `repeat=True` in the `Webdataset` constructor (in addition to Lightning's `limit_{train/val/test}_batches`), and just creating a regular Pytorch Dataloader from this...
@tmbdev So I tried disabling the repeat and just set the number of steps to be lower than the exact number of steps needed to go through 1 epoch, and...
After multiple tests, I've found that it is the caching that is most likely causing the issue. If I disable caching, then the number of processes stays within the expected...
I think I've now fixed the bug. I think what was happening was that, even when caching is enabled, Webdataset will still first `gopen`s the input urls, and then check...
Also, in line 69 of the above snippet, it appears that it uses `self.tempsuffix`, but I don't see this attribute defined in the class. Is this a typo?
I'm also getting this error with the 2.0 rc8 version, any updates on how to fix this?
Hi @igungor , Thanks for your reply. The s5cmd version was 1.2.1 and the output of `ulimit -a` when run on the actual host node of the docker container is:...