botocore
botocore copied to clipboard
Race condition in botocore's use of JSONFileCache for credential loading
Describe the bug
I am getting the a rare crash when using botocore to load files.
It's a JSONDecodeError: Expecting value: line 1 column 1 (char 0) stemming from JSONFileCache.__getitem__, which looks to imply the json file is empty (didn't find any value at char 0). Someone helped point out it might be a race condition in the JSONFileCache.__set__ , which appears to do:
f.truncate()
f.write(file_content)
We have multiple concurrent processes starting on a box that each are using botocore, so maybe this is just a race condition if one of them happens to look at the file post-truncate-pre-write? Not sure if a flock, or write-then-rename, or something else ends up a proper solution here?
Expected Behavior
It shouldn't crash
Current Behavior
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
File "botocore/utils.py", line 3518, in __getitem__
return json.load(f)
File "__init__.py", line 293, in load
return loads(fp.read(),
File "__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
File "botocore/client.py", line 553, in _api_call
return self._make_api_call(operation_name, kwargs)
File "botocore/client.py", line 989, in _make_api_call
http, parsed_response = self._make_request(
File "botocore/client.py", line 1015, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File "botocore/endpoint.py", line 198, in _send_request
request = self.create_request(request_dict, operation_model)
File "botocore/endpoint.py", line 134, in create_request
self._event_emitter.emit(
File "botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File "botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File "botocore/signers.py", line 105, in handler
return self.sign(operation_name, request)
File "botocore/signers.py", line 186, in sign
auth = self.get_auth_instance(**kwargs)
File "botocore/signers.py", line 301, in get_auth_instance
frozen_credentials = credentials.get_frozen_credentials()
File "botocore/credentials.py", line 634, in get_frozen_credentials
self._refresh()
File "botocore/credentials.py", line 522, in _refresh
self._protected_refresh(is_mandatory=is_mandatory_refresh)
File "botocore/credentials.py", line 538, in _protected_refresh
metadata = self._refresh_using()
File "botocore/credentials.py", line 685, in fetch_credentials
return self._get_cached_credentials()
File "botocore/credentials.py", line 693, in _get_cached_credentials
response = self._load_from_cache()
File "botocore/credentials.py", line 711, in _load_from_cache
creds = deepcopy(self._cache[self._cache_key])
File "botocore/utils.py", line 3520, in __getitem__
raise KeyError(cache_key)
Reproduction Steps
Hard to repro, and haven't tried myself. I assume thousands of processes recreating the botocore cached credentials file would do it.
Possible Solution
Perhaps a flock or a write-to-temp-file-then-rename-to-destination-file-address, instead of truncate-then-write?
Additional Information/Context
No response
SDK version used
1.34.42
Environment details (OS name and version, etc.)
AWS instance running x86_64 GNU/Linux
The following does reproduce the issue instantly:
from threading import Thread
from botocore.utils import JSONFileCache
cache = JSONFileCache()
def f():
for i in range(100000):
cache["key"] = 10
cache["key"]
threads = []
for i in range(2):
thread = Thread(target=f, name=f"Thread-{i}")
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Adding a lock in each of __getitem__, __delitem__, __setitem__ as you suggested does appear to solve the issue.
Thanks for reaching out. The error you referenced was also reported here: https://github.com/boto/botocore/issues/3106. I was advised that an error message like this could help clarify the behavior here: https://github.com/boto/botocore/pull/3183/files. Does deleting the cache file fix this for you?
Yeah wrapping this in a retry in our app code does work (and is what I did in parallel to filing this bug). Apologies for not realizing it was previously reported. And thank you Laurent for the multi-threaded repro! (Our problematic setup is multi-process, which is why I didn't propose in-process locks)
Thanks for following up and confirming, I'll go ahead and close this as a duplicate and we continue tracking the issue in https://github.com/boto/botocore/issues/3106.
This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.
I'm hitting this same issue - In my case I'm spawning 10 aws s3 cp processes concurrently in Go code using goroutines and a few of them will fail with the JSONDecodeError while the others succeed.
From what I can tell this only happens when the cache file doesn't already exist, so I'm working around it by doing a dummy aws cli call like aws sts get-caller-identity to force the cache file to be written to disk before I spawn the concurrent commands.
markdown is supported security policy and code of conduct is also different as security policy for the proper use
This happens to me frequently (more than once a week) while using skypilot, which makes heave use of the botocore library for the AWS API, often in parallel in multiple processes. It is not a temporary problem - the cache json will become malformed and all future attempts will crash until I manually fix or delete the malformed json file.
The nature of the broken file is like this:
{"startUrl": "https://<redacted>.awsapps.com/start", "region": "us-west-2", "accessToken": "<redacted>", "expiresAt": "2024-12-12T23:02:38Z", "clientId": "<redacted>", "clientSecret": "<redacted>", "registrationExpiresAt": "2025-01-20T23:08:06Z", "refreshToken": "<redacted>"}"}
There is always one or two extra characters at the end of the file. I believe this is likely due to the race condition mentioned in OP:
- process 1 calls f.truncate()
- process 2 calls f.truncate()
- process 1 calls f.write()
- process 2 calls f.write() (contents are 1-2 chars shorter than what process 1 wrote)
This happens to me frequently (more than once a week) while using skypilot, which makes heave use of the botocore library for the AWS API, often in parallel in multiple processes. It is not a temporary problem - the cache json will become malformed and all future attempts will crash until I manually fix or delete the malformed json file.
The nature of the broken file is like this:
{"startUrl": "https://<redacted>.awsapps.com/start", "region": "us-west-2", "accessToken": "<redacted>", "expiresAt": "2024-12-12T23:02:38Z", "clientId": "<redacted>", "clientSecret": "<redacted>", "registrationExpiresAt": "2025-01-20T23:08:06Z", "refreshToken": "<redacted>"}"}There is always one or two extra characters at the end of the file. I believe this is likely due to the race condition mentioned in OP:
- process 1 calls f.truncate()
- process 2 calls f.truncate()
- process 1 calls f.write()
- process 2 calls f.write() (contents are 1-2 chars shorter than what process 1 wrote)
Seeing the same thing--here's an example with identifiers removed
cat ~/.aws/cli/cache/1bd....json
{"Credentials": {"AccessKeyId": "...", "SecretAccessKey": "...", "SessionToken": "...", "Expiration": "2025-04-29T04:27:05+00:00", "AccountId": ".."}, "AssumedRoleUser": {"AssumedRoleId": "..:botocore-session-1745897217", "Arn": "arn:aws:sts::..:assumed-role/../botocore-session-1745897217"}, "ResponseMetadata": {"RequestId": "64e74709-7e12-4f7f-98c8-c12e322989b8", "HTTPStatusCode": 200, "HTTPHeaders": {"x-amzn-requestid": "64e74709-7e12-4f7f-98c8-c12e322989b8", "content-type": "text/xml", "content-length": "1512", "date": "Tue, 29 Apr 2025 03:27:05 GMT"}, "RetryAttempts": 1}}: true, "RetryAttempts": 2}}
You can see the issue near the end }}: true which isn't valid JSON
Probably needs something atomic like rename from a randomly generated name to the desired name or some locking contraption
I hate to say it, but at this rate I think the only way this will get fixed is if one of us writes the patch.
I'm not sure concurrency safety is really needed for CLI tools. A workaround like simply deleting the cache files would probably be sufficient. Either way, I've created a PR that might help with issues like this.
@ranlz77 the Team has mentioned few comments to addressed on you PR. please let us know if you can make the changes or not. If you are unable to make the requested modifications, please let us know and our team will create a pull request with the necessary changes.