streaming
streaming copied to clipboard
Is `StreamingDataset` compatible with Ray distributed training?
I saw the following comment in #332. Do you know if StreamingDataset is compatible with multi-node distributed training using Ray?
Hi @shivshandilya , Streaming dataset is not fully compatible with the PyTorch Lightening, there are some limitations on how PyTorch Lightening spins up the processes. For PyTorch Lightening (PTL) with DDP or other distributed strategy, when user run a script with
python main.py --n_devices 8, the main process runs themain.pyscript and when you calltrainer.fit(), PTL creates a N-1 (7 in this case) more processes (for a total of N, including the original main process). This means that the dataset created on the main process is constructed before any distributed environment variables are set, and before the other processes are even launched. Streaming dataset also relies on certain environment variables to get set before any process instantiate the StreamingDataset, otherwise it will use the default value for non-distributed use-case. Those variables are WORLD_SIZE, LOCAL_WORLD_SIZE, and RANK. Also, when using different strategy such as ddp_fork and ddp_spawn, PTL doesn't expose the environment variables (WORLD_SIZE, LOCAL_WORLD_SIZE, and RANK) which makes it impossible for Streaming to know how to split the dataset across ranks and how to communicate locally via multiprocessing shared memory.I would recommend to either use torchrun or composer launcher or other launcher which spins up the process similar to torchrun/composer and sets the above environment variables.
Originally posted by @karan6181 in https://github.com/mosaicml/streaming/issues/332#issuecomment-1642311367
Hi @genesis-jamin, we have not tried out Ray + Streaming Dataset extensively. But I would like to understand if you see any issues using it with Ray.
Closing out this issue as it has been inactive for a while.