torchsnapshot icon indicating copy to clipboard operation
torchsnapshot copied to clipboard

[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3.

Open hbikki opened this issue 1 year ago • 2 comments

🐛 Describe the bug

When loading snapshot from s3 we are seeing Nocredentials issue happening, this issue happens at random intervals. The issue is very similar to this from aiobotocore https://github.com/aio-libs/aiobotocore/issues/1006. This didn't happen when running <=5 process(assumption based on running tests with varying process.), but the error is consistent when running >5 process.

 Snapshot.take(path=str(save_dir), app_state=app_state)
  • Experimented adding retry with exponential back offs for restoring the snapshot.
  • Tried using different versions of aiobototcore.
  • verified from the logs , the _credential value is present.
  • verified credentials are available form the logs /0 [6]:[2023-05-14 00:49:02,211][aiobotocore.credentials][INFO] - Found credentials from IAM Role: <ROLE Name>
  • The issue doesn't happen when the credentials are set via ~/.aws/credentials file or environment variables.

NOTE: I don't see the failure when I updated and tested the S3 storage_plugin with botot3 s3 client or using botocore.session testing time is (2hrs) ~ 100 checkpoints.

Logs:

checkpointing_ddp/0 [3]:Traceback (most recent call last):
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/scheduler.py", line 369, in read_buffer
checkpointing_ddp/0 [3]:    await self.storage.read(read_io=read_io)
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-35' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155640>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,590][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [3]:  File "/home/User/torchsnapshot/torchsnapshot/storage_plugins/s3.py", line 60, in read
checkpointing_ddp/0 [3]:    response = await client.get_object(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 354, in _make_api_call
checkpointing_ddp/0 [3]:    http, parsed_response = await self._make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/client.py", line 379, in _make_request
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,610][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,589][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [3]:    return await self._endpoint.make_request(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 96, in _send_request
checkpointing_ddp/0 [3]:    request = await self.create_request(request_dict, operation_model)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/endpoint.py", line 84, in create_request
checkpointing_ddp/0 [0]:task: <Task pending name='Task-36' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155790>()]>>
checkpointing_ddp/0 [6]:[2023-05-14 00:17:58,634][aiobotocore.credentials][INFO] - Found credentials from IAM Role: ShopQADeveloperASGRole
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-37' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978155550>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-38' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007c10>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-39' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5978007ac0>()]>>
checkpointing_ddp/0 [0]:[2023-05-14 00:17:58,590][asyncio][ERROR] - Task was destroyed but it is pending!
checkpointing_ddp/0 [0]:task: <Task pending name='Task-40' coro=<_ReadPipeline.read_buffer() running at /home/User/torchsnapshot/torchsnapshot/scheduler.py:369> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f596ea95fa0>()]>>
checkpointing_ddp/0 [3]:    await self._event_emitter.emit(
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/hooks.py", line 66, in _emit
checkpointing_ddp/0 [3]:    response = await resolve_awaitable(handler(**kwargs))
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/_helpers.py", line 15, in resolve_awaitable
checkpointing_ddp/0 [3]:    return await obj
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 24, in handler
checkpointing_ddp/0 [3]:    return await self.sign(operation_name, request)
checkpointing_ddp/0 [3]:  File "/home/User/aiobotocore/aiobotocore/signers.py", line 82, in sign
checkpointing_ddp/0 [3]:    auth.add_auth(request)
checkpointing_ddp/0 [3]:  File "/opt/conda/envs/User/lib/python3.9/site-packages/botocore/auth.py", line 418, in add_auth
checkpointing_ddp/0 [3]:    raise NoCredentialsError()
checkpointing_ddp/0 [3]:botocore.exceptions.NoCredentialsError: Unable to locate credentials


Versions

pytorch = 2.0.0+cu117 torchx-nightly>=2023.3.15 torchsnapshot=0.1.0

hbikki avatar May 16 '23 05:05 hbikki

Thanks for reporting @hbikki. You mentioned in the aio-libs issue that "when reading/writing to S3 with process count > 5 for versions 2.4.2". Curious if you had success with other versions?

yifuwang avatar May 24 '23 18:05 yifuwang

No it isn't working even with diff versions of aiobototcore

hbikki avatar May 25 '23 17:05 hbikki