botocore icon indicating copy to clipboard operation
botocore copied to clipboard

Race condition in botocore's use of JSONFileCache for credential loading

Open mikelambert opened this issue 1 year ago • 10 comments

Describe the bug

I am getting the a rare crash when using botocore to load files.

It's a JSONDecodeError: Expecting value: line 1 column 1 (char 0) stemming from JSONFileCache.__getitem__, which looks to imply the json file is empty (didn't find any value at char 0). Someone helped point out it might be a race condition in the JSONFileCache.__set__ , which appears to do:

            f.truncate()
            f.write(file_content)

We have multiple concurrent processes starting on a box that each are using botocore, so maybe this is just a race condition if one of them happens to look at the file post-truncate-pre-write? Not sure if a flock, or write-then-rename, or something else ends up a proper solution here?

Expected Behavior

It shouldn't crash

Current Behavior

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
  File "botocore/utils.py", line 3518, in __getitem__
    return json.load(f)
  File "__init__.py", line 293, in load
    return loads(fp.read(),
  File "__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None


  File "botocore/client.py", line 553, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "botocore/client.py", line 989, in _make_api_call
    http, parsed_response = self._make_request(
  File "botocore/client.py", line 1015, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "botocore/signers.py", line 186, in sign
    auth = self.get_auth_instance(**kwargs)
  File "botocore/signers.py", line 301, in get_auth_instance
    frozen_credentials = credentials.get_frozen_credentials()
  File "botocore/credentials.py", line 634, in get_frozen_credentials
    self._refresh()
  File "botocore/credentials.py", line 522, in _refresh
    self._protected_refresh(is_mandatory=is_mandatory_refresh)
  File "botocore/credentials.py", line 538, in _protected_refresh
    metadata = self._refresh_using()
  File "botocore/credentials.py", line 685, in fetch_credentials
    return self._get_cached_credentials()
  File "botocore/credentials.py", line 693, in _get_cached_credentials
    response = self._load_from_cache()
  File "botocore/credentials.py", line 711, in _load_from_cache
    creds = deepcopy(self._cache[self._cache_key])
  File "botocore/utils.py", line 3520, in __getitem__
    raise KeyError(cache_key)

Reproduction Steps

Hard to repro, and haven't tried myself. I assume thousands of processes recreating the botocore cached credentials file would do it.

Possible Solution

Perhaps a flock or a write-to-temp-file-then-rename-to-destination-file-address, instead of truncate-then-write?

Additional Information/Context

No response

SDK version used

1.34.42

Environment details (OS name and version, etc.)

AWS instance running x86_64 GNU/Linux

mikelambert avatar Jun 27 '24 01:06 mikelambert

The following does reproduce the issue instantly:

from threading import Thread
from botocore.utils import JSONFileCache

cache = JSONFileCache()


def f():
    for i in range(100000):
        cache["key"] = 10
        cache["key"]


threads = []
for i in range(2):
    thread = Thread(target=f, name=f"Thread-{i}")
    threads.append(thread)
    thread.start()

for thread in threads:
    thread.join()

Adding a lock in each of __getitem__, __delitem__, __setitem__ as you suggested does appear to solve the issue.

Laurent-Dx avatar Jun 27 '24 19:06 Laurent-Dx

Thanks for reaching out. The error you referenced was also reported here: https://github.com/boto/botocore/issues/3106. I was advised that an error message like this could help clarify the behavior here: https://github.com/boto/botocore/pull/3183/files. Does deleting the cache file fix this for you?

tim-finnigan avatar Jun 28 '24 17:06 tim-finnigan

Yeah wrapping this in a retry in our app code does work (and is what I did in parallel to filing this bug). Apologies for not realizing it was previously reported. And thank you Laurent for the multi-threaded repro! (Our problematic setup is multi-process, which is why I didn't propose in-process locks)

mikelambert avatar Jun 29 '24 00:06 mikelambert

Thanks for following up and confirming, I'll go ahead and close this as a duplicate and we continue tracking the issue in https://github.com/boto/botocore/issues/3106.

tim-finnigan avatar Jul 02 '24 22:07 tim-finnigan

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

github-actions[bot] avatar Jul 02 '24 22:07 github-actions[bot]

I'm hitting this same issue - In my case I'm spawning 10 aws s3 cp processes concurrently in Go code using goroutines and a few of them will fail with the JSONDecodeError while the others succeed.

From what I can tell this only happens when the cache file doesn't already exist, so I'm working around it by doing a dummy aws cli call like aws sts get-caller-identity to force the cache file to be written to disk before I spawn the concurrent commands.

sfc-gh-abalik avatar Jul 31 '24 04:07 sfc-gh-abalik

markdown is supported security policy and code of conduct is also different as security policy for the proper use

amberkushwaha avatar Aug 27 '24 12:08 amberkushwaha

This happens to me frequently (more than once a week) while using skypilot, which makes heave use of the botocore library for the AWS API, often in parallel in multiple processes. It is not a temporary problem - the cache json will become malformed and all future attempts will crash until I manually fix or delete the malformed json file.

The nature of the broken file is like this:

{"startUrl": "https://<redacted>.awsapps.com/start", "region": "us-west-2", "accessToken": "<redacted>", "expiresAt": "2024-12-12T23:02:38Z", "clientId": "<redacted>", "clientSecret": "<redacted>", "registrationExpiresAt": "2025-01-20T23:08:06Z", "refreshToken": "<redacted>"}"}

There is always one or two extra characters at the end of the file. I believe this is likely due to the race condition mentioned in OP:

  • process 1 calls f.truncate()
  • process 2 calls f.truncate()
  • process 1 calls f.write()
  • process 2 calls f.write() (contents are 1-2 chars shorter than what process 1 wrote)

cg505 avatar Dec 11 '24 22:12 cg505

This happens to me frequently (more than once a week) while using skypilot, which makes heave use of the botocore library for the AWS API, often in parallel in multiple processes. It is not a temporary problem - the cache json will become malformed and all future attempts will crash until I manually fix or delete the malformed json file.

The nature of the broken file is like this:

{"startUrl": "https://<redacted>.awsapps.com/start", "region": "us-west-2", "accessToken": "<redacted>", "expiresAt": "2024-12-12T23:02:38Z", "clientId": "<redacted>", "clientSecret": "<redacted>", "registrationExpiresAt": "2025-01-20T23:08:06Z", "refreshToken": "<redacted>"}"}

There is always one or two extra characters at the end of the file. I believe this is likely due to the race condition mentioned in OP:

  • process 1 calls f.truncate()
  • process 2 calls f.truncate()
  • process 1 calls f.write()
  • process 2 calls f.write() (contents are 1-2 chars shorter than what process 1 wrote)

Seeing the same thing--here's an example with identifiers removed

cat ~/.aws/cli/cache/1bd....json
{"Credentials": {"AccessKeyId": "...", "SecretAccessKey": "...", "SessionToken": "...", "Expiration": "2025-04-29T04:27:05+00:00", "AccountId": ".."}, "AssumedRoleUser": {"AssumedRoleId": "..:botocore-session-1745897217", "Arn": "arn:aws:sts::..:assumed-role/../botocore-session-1745897217"}, "ResponseMetadata": {"RequestId": "64e74709-7e12-4f7f-98c8-c12e322989b8", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "64e74709-7e12-4f7f-98c8-c12e322989b8", "content-type": "text/xml", "content-length": "1512", "date": "Tue, 29 Apr 2025 03:27:05 GMT"}, "RetryAttempts": 1}}: true, "RetryAttempts": 2}}

You can see the issue near the end }}: true which isn't valid JSON

Probably needs something atomic like rename from a randomly generated name to the desired name or some locking contraption

nijave avatar Apr 29 '25 17:04 nijave

I hate to say it, but at this rate I think the only way this will get fixed is if one of us writes the patch.

cg505 avatar Apr 29 '25 18:04 cg505

I'm not sure concurrency safety is really needed for CLI tools. A workaround like simply deleting the cache files would probably be sufficient. Either way, I've created a PR that might help with issues like this.

ranlz77 avatar Aug 24 '25 18:08 ranlz77

@ranlz77 the Team has mentioned few comments to addressed on you PR. please let us know if you can make the changes or not. If you are unable to make the requested modifications, please let us know and our team will create a pull request with the necessary changes.

adev-code avatar Oct 02 '25 17:10 adev-code