ignite
ignite copied to clipboard
Scheduling to control the download of datasets
🚀 Feature
The first intent of this FR is to address multiple downloading of datasets. The idea is to provide a way to schedule process using context manager. The inspiration comes from the parallel sections from OpenMP and the pe logger.
Let's consider a scheduler of process based on context manager defined according to an ordering policy. This policy could be for instance an ordered scheduling of process (useful for debug)
with scheduler(order=Ordered()):
print(f"{idist.get_rank()}...")
Result in console
0...
1...
2...
3...
Another policy could be to do something on master process (rank=0) while others wait. When master has finished, others can run.
with scheduler(order=MasterFirst()) as download:
cifar = torchvision.datasets.CIFAR10(root="/tmp/cifar", download=download)
Here, only one process downloads while others wait. Therefore every process build the dataset.
A last policy should be a specific policy to check if a path is shared by multiple process. Only one is used to download the dataset.
path = f"/tmp/shared_{idist.get_rank() % 2}"
with scheduler(order=SharedPath(path=path)) as download:
cifar = torchvision.datasets.CIFAR10(root=path, download=download)
Here, two process should download in separate paths while others wait. It means, for instance, on a cluster, if path /tmp is used, only one process per node should download the dataset.
I already coded this FR. Let me know if relevant and I will push a PR.
@sdesrozis interesting ideas. I think a context manager with at least zero rank + barrier could be helpful.
The name scheduler is a bit unclear, if we put it to as idist.rank_scheduler maybe would make sense...
By the way, I also saw this approach for downloading on a single rank:
from filelock import FileLock
with FileLock(os.path.join(tempfile.gettempdir(), "download_data.lock")):
if not os.path.exists(dataset_path):
download_data(dataset_path)
- https://docs.ray.io/en/latest/raysgd/raysgd_pytorch.html#debugging-tips
Using FileLock was my idea to implement SharedPath but I totally agree that the naming is not so good. That's why I prefer discuss it here with interested people before submitting a PR.
Since it seems relevant, I will submit a PR and then we can discuss it precisely. What do you think ?