streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Is `StreamingDataset` compatible with Ray distributed training?

Open genesis-jamin opened this issue 2 years ago • 1 comments

I saw the following comment in #332. Do you know if StreamingDataset is compatible with multi-node distributed training using Ray?

Hi @shivshandilya , Streaming dataset is not fully compatible with the PyTorch Lightening, there are some limitations on how PyTorch Lightening spins up the processes. For PyTorch Lightening (PTL) with DDP or other distributed strategy, when user run a script with python main.py --n_devices 8, the main process runs the main.py script and when you call trainer.fit(), PTL creates a N-1 (7 in this case) more processes (for a total of N, including the original main process). This means that the dataset created on the main process is constructed before any distributed environment variables are set, and before the other processes are even launched. Streaming dataset also relies on certain environment variables to get set before any process instantiate the StreamingDataset, otherwise it will use the default value for non-distributed use-case. Those variables are WORLD_SIZE, LOCAL_WORLD_SIZE, and RANK. Also, when using different strategy such as ddp_fork and ddp_spawn, PTL doesn't expose the environment variables (WORLD_SIZE, LOCAL_WORLD_SIZE, and RANK) which makes it impossible for Streaming to know how to split the dataset across ranks and how to communicate locally via multiprocessing shared memory.

I would recommend to either use torchrun or composer launcher or other launcher which spins up the process similar to torchrun/composer and sets the above environment variables.

Originally posted by @karan6181 in https://github.com/mosaicml/streaming/issues/332#issuecomment-1642311367

genesis-jamin avatar Nov 07 '23 20:11 genesis-jamin

Hi @genesis-jamin, we have not tried out Ray + Streaming Dataset extensively. But I would like to understand if you see any issues using it with Ray.

karan6181 avatar Nov 08 '23 20:11 karan6181

Closing out this issue as it has been inactive for a while.

snarayan21 avatar May 29 '24 19:05 snarayan21