tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

[BUG] corrupted dataset due to simultaneous downloading by all ranks.

Open LamForest opened this issue 1 year ago • 0 comments

Add Link

https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

Describe the bug

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 9912422/9912422 [00:03<00:00, 3078874.05it/s]

  5%|█████▎                                                                                                    | 491520/9912422 [00:01<00:22, 417952.41it/s]Traceback (most recent call last):
  File "fsdp_mnist.py", line 173, in <module>
    mp.spawn(fsdp_main,
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/ssd1/gaotianlin/baidu/hac-aiacc/Megatron/old_scripts/fsdp/fsdp_mnist.py", line 94, in fsdp_main
    dataset1 = datasets.MNIST('./data', train=True, download=True,
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
    self.download()
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/mnist.py", line 187, in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/utils.py", line 434, in download_and_extract_archive
    download_url(url, download_root, filename, md5)
  File "/root/miniconda3/envs/old_mega/lib/python3.8/site-packages/torchvision/datasets/utils.py", line 155, in download_url
    raise RuntimeError("File not found or corrupted.")
RuntimeError: File not found or corrupted.

/root/miniconda3/envs/old_mega/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Describe your environment

...

LamForest avatar Sep 29 '24 18:09 LamForest