Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file.
🚀 Feature
Motivation
Pitch
Alternatives
Additional context
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Let's keep this open, as we've been experimenting around this issue. We'll continue exploring and will add our findings here from the last few experiments.
Idea: One of the other ideas that could be explored is this: we create a multiprocessing dictionary and share it between the workers or a simple dict to keep within the worker process. The downloader workers/threads would add data to the buffer corresponding to a chunk key in the shared dictionary, and the reader can then check for the existence of that key—along with certain byte ranges or the full size of the chunk—before starting to read.**
Some potential downsides of this approach might include:
- Performance bottlenecks due to the overhead of
multiprocessing.Manager().dict(), especially under heavy concurrent access. - Synchronization complexity, as ensuring thread/process safety for concurrent reads and writes to the buffer may require locks or queues.
- Memory management issues, particularly if chunks are large or not cleared after use.
- Limited scalability, since Python multiprocessing may not efficiently handle shared state across many processes compared to more optimized shared memory structures."
I tried something similar in my Rust PR, where the download chunk has deserialized and unflattened and contained dict {index: item}, but the ram usage increase exponentially. Close to 50-60 GB.
I didn't tried using multiprocessing dict for sharing, But I think this will be much more complicated.
Dataloader instantiates multiple dataset worker, and multiple dataloader are instantiated by
lightning.To be able to share dict across the dataloader, won't be trivial.
Update
I prototyped an in-memory version of PyTreeLoader, similar to our earlier streaming ideas. It downloads chunks into RAM and streams them as they arrive—no file writes.
Tested on 12GB ImageNet:
- Baseline: ~6.5k samples/sec
- In-memory: ~5k samples/sec (even with sequential byte-range downloads)
Not faster yet, but promising given the early state. Prototype here: code
Also Thomas mentioned/suggested:
In theory, S3 client can download to an
io.BytesIO, and we can use a threading lock to block reading until fully downloaded.
We can use
s3.download_fileobj('amzn-s3-demo-bucket', 'OBJECT_NAME', f)with f being an io.Bytes And we can extend pre_load_chunk to pass the bytes, so it gets stored within the item loader dictionary So basically, one thread pass the data to another
More to come with further investigation—just leaving this here for the history and future reference.