torchsnapshot
torchsnapshot copied to clipboard
[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3.
🐛 Describe the bug
When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals. The issue is very similar to this from aiobotocore https://github.com/aio-libs/aiobotocore/issues/1006. This didn't happen when running <=5 process(assumption based on running tests with varying process.), but the error is consistent when running >5 process.
Snapshot.take(path=str(save_dir), app_state=app_state)
- Experimented adding retry with exponential back offs for restoring the snapshot.
- Tried using different versions of aiobototcore.
- verified from the logs , the _credential value is present.
- verified credentials are available form the logs /0 [6]:[2023-05-14 00:49:02,211][aiobotocore.credentials][INFO] - Found credentials from IAM Role: <ROLE Name>
- The issue doesn't happen when the credentials are set via ~/.aws/credentials file or environment variables.
NOTE: I don't see the failure when I updated and tested the S3 storage_plugin with botot3 s3 client or using botocore.session testing time is (2hrs) ~ 100 checkpoints.
Logs:
checkpointing_ddp/0 [3]:Traceback (most recent call last):
checkpointing_ddp/0 [3]: File "/home/User/torchsnapshot/torchsnapshot/scheduler.py", line 369, in read_buffer
checkpointing_ddp/0 [3]: await self.storage.read(read_io=read_io)
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-35' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155640>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,590][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [3]: File "/home/User/torchsnapshot/torchsnapshot/storage_plugins/s3.py", line 60, in read
checkpointing_ddp/0 [3]: response = await client.get_object(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/client.py", line 354, in _make_api_call
checkpointing_ddp/0 [3]: http, parsed_response = await self._make_request(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/client.py", line 379, in _make_request
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,610][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [3]: return await self._endpoint.make_request(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 96, in _send_request
checkpointing_ddp/0 [3]: request = await self.create_request(request_dict, operation_model)
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 84, in create_request
checkpointing_ddp/0 [0]:task: <Task pending name='Task-36' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155790>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,634][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-37' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155550>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-38' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007c10>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-39' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007ac0>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-40' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f596ea95fa0>()]>>
checkpointing_ddp/0 [3]: await self._event_emitter.emit(
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/hooks.py", line 66, in _emit
checkpointing_ddp/0 [3]: response = await resolve_awaitable(handler(**kwargs))
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/_helpers.py", line 15, in resolve_awaitable
checkpointing_ddp/0 [3]: return await obj
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/signers.py", line 24, in handler
checkpointing_ddp/0 [3]: return await self.sign(operation_name, request)
checkpointing_ddp/0 [3]: File "/home/User/aiobotocore/aiobotocore/signers.py", line 82, in sign
checkpointing_ddp/0 [3]: auth.add_auth(request)
checkpointing_ddp/0 [3]: File "/opt/conda/envs/User/lib/python3.9/site-packages/botocore/auth.py", line 418, in add_auth
checkpointing_ddp/0 [3]: raise NoCredentialsError()
checkpointing_ddp/0 [3]:botocore.exceptions.NoCredentialsError: Unable to locate credentials
Versions
pytorch = 2.0.0+cu117 torchx-nightly>=2023.3.15 torchsnapshot=0.1.0
Thanks for reporting @hbikki. You mentioned in the aio-libs issue that "when reading/writing to S3 with process count > 5 for versions 2.4.2". Curious if you had success with other versions?
No it isn't working even with diff versions of aiobototcore