Pipeline to iterate over a single NumPy file
Describe the question.
I'm trying to use GPU direct storage (GDS) via DALI's numpy reader for a dataset of many (10^4) 3D volumes (each volume is one training sample). However, the API seems to require that one file only contains one sample, so each sample will have to be in a different file, leading to tens of thousands of files. Opening this many files each training epoch could have significant overhead for certain file systems. Is there a way to use larger files instead (for example stacking volumes into chunks) and iterate over a dimension? #4140 suggests using an external source for this, but that would not support GDS.
Check for duplicates
- [x] I have searched the open bugs/issues and have found no duplicates for this bug report
Hi @ziw-liu,
Thank you for reaching out.
However, the API seems to require that one file only contains one sample, so each sample will have to be in a different file, leading to tens of thousands of files. Opening this many files each training epoch could have significant overhead for certain file systems
I agree, that opening many files is inefficient from the file system point of view as well as GDS which works best for large files, not the small ones where the cost of the GDS initialization may outweigh the benefits. There is an option to use `roi'- related arguments to read only a slice of the file. However, for the GDS variant of the numpy reader, the file is read as a whole, and then only a part of it is sliced on the GPU (so we are wasting part of the IO). For some configurations, it may be possible to let the GDS reach only the chunk of data that corresponds to a particular subsample (theoretically), but depending on the slicing pattern, this may or may not be efficient.
Can you tell us more about your use case? Do you see an IO bottleneck when using a plain IO without GDS? Are you saturating the storage IO or the CPU is busy enough to prevent this?
Hi @JanuszL and thanks for the quick answer!
There is an option to use `roi'- related arguments to read only a slice of the file. However, for the GDS variant of the numpy reader, the file is read as a whole, and then only a part of it is sliced on the GPU (so we are wasting part of the IO).
I was trying that, and because my files were larger than VRAM (<1% chunks of a 10^1 TB dataset), it would OOM before getting to the slicing step.
Can you tell us more about your use case? Do you see an IO bottleneck when using a plain IO without GDS? Are you saturating the storage IO or the CPU is busy enough to prevent this?
I was just starting to explore DALI. I used to have a I/O bottleneck when reading with python code and the data is on NFS (VAST) and had to pre-cache on the compute nodes (DGX H100/H200), which poses a size limit. Reading this article, I thought GDS would be a good way to avoid this step, but NFS suffers a lot from metadata overhead of opening many files. If I use an external source for DALI, I won't be able to use DALI's thread pool and have to use multiprocessing, which should then be similar to running them in a multi-worker PyTorch dataloader?
Hi @ziw-liu,
Thank you for providing the details of your use case.
I used to have a I/O bottleneck when reading with python code and the data is on NFS (VAST) and had to pre-cache on the compute nodes (DGX H100/H200)
I would first confirm that you are the GDS is the solution. You can check if CPU utilization is height and this is the limiting factor or the storage just cannot feed the data faster no matter what. Maybe you can try https://github.com/rapidsai/kvikio for the initial evaluation?
Thanks, for now I can still afford to pre-cache. Another major reason to try DALI is that we are doing some computation-heavy augmentations and that creates some compute contention on the CPU.
Maybe you can try https://github.com/rapidsai/kvikio for the initial evaluation?
I was also looking at kvikio. I guess if I use GDS via that I would use an external source with the CuPy interface to feed that into a DALI pipeline?
I was also looking at kvikio. I guess if I use GDS via that I would use an external source with the CuPy interface to feed that into a DALI pipeline?
I think that should work. Please give it a go and let us know how that works for you.