FSSpecFileOpenerIterDataPipe raises a `NoCredentialsError` on large dataloader `num_worker`
🐛 Describe the bug
There are two issues (both are reproducible using the script below):
-
FSSpecFileOpenerIterDataPipegets stuck if one tries to iteratively createDataLoader(num_workers=0, ...)thenDataLoader(num_workers=greater_than_zero). Practically speaking this isn't much of an issue since typically a trainer will create the dataloader once but for benchmarking this means that we can't iterate benchmark runs that change dataloadernum_workersfrom the same parent process. -
NoCredentialsErrorwhen usingFSSpecFileOpenerIterDataPipewith large (>64) dataloadernum_workers.
Repro Script
from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterableWrapper
if __name__ == "__main__":
print(f"=== BEGIN REPRO TEST ===")
data_s3url = "s3://<REPLACE_WITH_YOUR_S3_URL>"
# workers = [0, 1, 2] # <-- use this to repro stuckness. You'll observe that the loop below will get stuck when i=1
workers = [1, 2, 4, 8, 16, 32, 48, 64, 128]
for i in workers:
dataset = (
IterableWrapper([data_s3url])
.list_files_by_fsspec()
.open_files_by_fsspec()
.readlines(return_path=False)
)
try:
for batch in DataLoader(
dataset,
batch_size=max(workers) * 2,
num_workers=i,
):
break
print(f"Succeeded running with num_workers={i}")
except Exception as e:
print(f"Error running with num_workers={i}. Exception: {e}")
print(f"=== END REPRO TEST ===")
Exception
Traceback (most recent call last):
File "/home/ubuntu/workspace/mfive/mfive/examples/data/repro.py", line 28, in <module>
for batch in DataLoader(
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/_utils.py", line 460, in reraise
raise RuntimeError(msg) from None
RuntimeError: Caught NoCredentialsError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/datapipe.py", line 344, in __iter__
yield from self._datapipe
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torchdata/datapipes/iter/util/plain_text_reader.py", line 121, in __iter__
for path, file in self.source_datapipe:
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torchdata/datapipes/iter/load/fsspec.py", line 137, in __iter__
for file_uri in self.source_datapipe:
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py", line 514, in wrap_generator
response = gen.send(None)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/torchdata/datapipes/iter/load/fsspec.py", line 85, in __iter__
for file_name in fs.ls(path):
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/fsspec/asyn.py", line 91, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/fsspec/asyn.py", line 71, in sync
raise return_result
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/s3fs/core.py", line 810, in _ls
files = await self._lsdir(path, refresh)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/s3fs/core.py", line 593, in _lsdir
async for i in it:
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/paginate.py", line 32, in __anext__
response = await self._make_request(current_kwargs)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/client.py", line 173, in _make_api_call
http, parsed_response = await self._make_request(
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/client.py", line 193, in _make_request
return await self._endpoint.make_request(operation_model, request_dict)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/endpoint.py", line 77, in _send_request
request = await self.create_request(request_dict, operation_model)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/endpoint.py", line 70, in create_request
await self._event_emitter.emit(event_name, request=request,
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/hooks.py", line 27, in _emit
response = await handler(**kwargs)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/signers.py", line 16, in handler
return await self.sign(operation_name, request)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/aiobotocore/signers.py", line 63, in sign
auth.add_auth(request)
File "/home/ubuntu/.pyenv/versions/venv39/lib/python3.9/site-packages/botocore/auth.py", line 378, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
This exception is thrown by __iter__ of FSSpecFileListerIterDataPipe(kwargs={}, masks='')
Versions
-
torchdata-0.4.1 -
torch-1.12.1 -
fsspec-2022.1.0 -
s3fs-2022.1.0
Thanks for opening the issue.
I am able to reproduce the first issue that the subsequent DataLoader would be stuck when the prior DataLoader has num_workers=0. I am investigating it now.
However, I am not able to reproduce the second issue. I am not sure why larger number of processes would interfere the credential. I suspect this is tied to boto3 not handling racing on credential read correctly.
For Issue 1, I am currently able to boil down to fork starting method. If I set multiprocessing_context="spawn" or "forkserver" for the subsequent DataLoader, there won't be hanging issue.
Hi, regarding issue 2, I am using FSSpecFileOpenerIterDataPipe with 22 workers to load data from s3. I am also getting NoCredentialsError. Any progress on this issue? Thanks! (also with 4 GPUs)
I ran into the same issue and found that setting AWS_METADATA_SERVICE_NUM_ATTEMPTS helps mitigate the issue as mentioned here.