streaming Is `StreamingDataset` compatible with Ray distributed training?

Is `StreamingDataset` compatible with Ray distributed training?

Open genesis-jamin opened this issue 2 years ago • 1 comments

I saw the following comment in #332. Do you know if StreamingDataset is compatible with multi-node distributed training using Ray?

Hi @shivshandilya , Streaming dataset is not fully compatible with the PyTorch Lightening, there are some limitations on how PyTorch Lightening spins up the processes. For PyTorch Lightening (PTL) with DDP or other distributed strategy, when user run a script with python main.py --n_devices 8, the main process runs the main.py script and when you call trainer.fit(), PTL creates a N-1 (7 in this case) more processes (for a total of N, including the original main process). This means that the dataset created on the main process is constructed before any distributed environment variables are set, and before the other processes are even launched. Streaming dataset also relies on certain environment variables to get set before any process instantiate the StreamingDataset, otherwise it will use the default value for non-distributed use-case. Those variables are WORLD_SIZE, LOCAL_WORLD_SIZE, and RANK. Also, when using different strategy such as ddp_fork and ddp_spawn, PTL doesn't expose the environment variables (WORLD_SIZE, LOCAL_WORLD_SIZE, and RANK) which makes it impossible for Streaming to know how to split the dataset across ranks and how to communicate locally via multiprocessing shared memory.

I would recommend to either use torchrun or composer launcher or other launcher which spins up the process similar to torchrun/composer and sets the above environment variables.

Originally posted by @karan6181 in https://github.com/mosaicml/streaming/issues/332#issuecomment-1642311367

Nov 07 '23 20:11 genesis-jamin

Hi @genesis-jamin, we have not tried out Ray + Streaming Dataset extensively. But I would like to understand if you see any issues using it with Ray.

Nov 08 '23 20:11 karan6181

Closing out this issue as it has been inactive for a while.

May 29 '24 19:05 snarayan21

streaming streaming copied to clipboard

Is `StreamingDataset` compatible with Ray distributed training?

streaming
streaming copied to clipboard