litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file.

Open tchaton opened this issue 1 year ago • 5 comments

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

tchaton avatar Aug 01 '24 13:08 tchaton

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]

Let's keep this open, as we've been experimenting around this issue. We'll continue exploring and will add our findings here from the last few experiments.

bhimrazy avatar Apr 17 '25 07:04 bhimrazy

Idea: One of the other ideas that could be explored is this: we create a multiprocessing dictionary and share it between the workers or a simple dict to keep within the worker process. The downloader workers/threads would add data to the buffer corresponding to a chunk key in the shared dictionary, and the reader can then check for the existence of that key—along with certain byte ranges or the full size of the chunk—before starting to read.**

Some potential downsides of this approach might include:

  • Performance bottlenecks due to the overhead of multiprocessing.Manager().dict(), especially under heavy concurrent access.
  • Synchronization complexity, as ensuring thread/process safety for concurrent reads and writes to the buffer may require locks or queues.
  • Memory management issues, particularly if chunks are large or not cleared after use.
  • Limited scalability, since Python multiprocessing may not efficiently handle shared state across many processes compared to more optimized shared memory structures."

bhimrazy avatar Apr 17 '25 07:04 bhimrazy

I tried something similar in my Rust PR, where the download chunk has deserialized and unflattened and contained dict {index: item}, but the ram usage increase exponentially. Close to 50-60 GB.

I didn't tried using multiprocessing dict for sharing, But I think this will be much more complicated.

Dataloader instantiates multiple dataset worker, and multiple dataloader are instantiated by lightning.

To be able to share dict across the dataloader, won't be trivial.

deependujha avatar May 05 '25 11:05 deependujha

Update I prototyped an in-memory version of PyTreeLoader, similar to our earlier streaming ideas. It downloads chunks into RAM and streams them as they arrive—no file writes.

Tested on 12GB ImageNet:

  • Baseline: ~6.5k samples/sec
  • In-memory: ~5k samples/sec (even with sequential byte-range downloads)

Not faster yet, but promising given the early state. Prototype here: code

Also Thomas mentioned/suggested:

In theory, S3 client can download to an io.BytesIO, and we can use a threading lock to block reading until fully downloaded.

We can use s3.download_fileobj('amzn-s3-demo-bucket', 'OBJECT_NAME', f) with f being an io.Bytes And we can extend pre_load_chunk to pass the bytes, so it gets stored within the item loader dictionary So basically, one thread pass the data to another

More to come with further investigation—just leaving this here for the history and future reference.

bhimrazy avatar Jun 04 '25 06:06 bhimrazy