botocore icon indicating copy to clipboard operation
botocore copied to clipboard

Credentials JSONFileCache explodes if cache file has malformed JSON

Open mildmojo opened this issue 1 year ago • 9 comments

Describe the bug

The JSONFileCache class __getitem__ accessor tries to load JSON from a file, but does not handle syntax errors in that file, raising an exception that bubbles up to the client application.

Given this is a cache, a failure to read from the cache should be treated as a cache miss, not an error.

Expected Behavior

In botocore/credentials.py, in the CachedCredentialFetcher class, I expected _load_from_cache() to return None if the cache was invalid, causing credentials to be re-fetched and the cache to be rebuilt.

Current Behavior

I found this while using awscli, which consumes botocore. When a credentials cache file was malformed, every command using that particular AWS account failed super cryptically, printing the bad file's cache key to STDERR:

$ aws ecr get-login-password --profile root

'c09eaf18178406053ca8802bc45deadbeefdc57a'

$ echo $?
255

Using a debugger, I located the failure at utils.py:3518, when botocore failed to parse a cache file.

Cache file contents (prettified here):

{
    "Credentials":
    {
        "AccessKeyId": "<redacted>",
        "SecretAccessKey": "<redacted>",
        "SessionToken": "<redacted>",
        "Expiration": "2024-01-24T02:49:26+00:00"
    },
    "AssumedRoleUser":
    {
        "AssumedRoleId": "<redacted>:botocore-session-1706060963",
        "Arn": "arn:aws:sts::<redacted>:assumed-role/Admin/botocore-session-1706060963"
    },
    "PackedPolicySize": 6,
    "ResponseMetadata":
    {
        "RequestId": "<redacted>",
        "HTTPStatusCode": 200,
        "HTTPHeaders":
        {
            "x-amzn-requestid": "<redacted>",
            "content-type": "text/xml",
            "content-length": "1569",
            "date": "Wed, 24 Jan 2024 01:49:26 GMT"
        },
        "RetryAttempts": 1
    }
}: true, "RetryAttempts": 2}}

(The bit at the bottom, : true, "RetryAttempts": 2}} invalidates the JSON.)

Reproduction Steps

Here's a minimal reproduction:

# json-cache-repro.py

from botocore.utils import JSONFileCache
from botocore.credentials import CachedCredentialFetcher
from os import remove

CACHE_KEY = 'foo-botocore-test'
CACHE_DIR = '/tmp'
CACHE_FILE = f'{CACHE_DIR}/{CACHE_KEY}.json'

class TestCredentialsFetcher(CachedCredentialFetcher):
    def _create_cache_key(_):
        return CACHE_KEY

with open(CACHE_FILE, 'w') as f:
    f.write('hello')

cache = JSONFileCache(working_dir=CACHE_DIR)

fetcher = TestCredentialsFetcher(cache=cache)

try:
    fetcher.fetch_credentials()
finally:
    remove(CACHE_FILE)

Example run:

$ python -m venv json-cache-repro
$ . json-cache-repro/bin/activate
(json-cache-repro) $ pip install botocore

(json-cache-repro) $ python3 json-cache-repro.py
Traceback (most recent call last):
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/utils.py", line 3518, in __getitem__
    return json.load(f)
           ^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 293, in load
    return loads(fp.read(),
           ^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/[email protected]/3.11.7_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/me/json-cache-repro.py", line 17, in <module>
    fetcher.fetch_credentials()
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/credentials.py", line 685, in fetch_credentials
    return self._get_cached_credentials()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/credentials.py", line 693, in _get_cached_credentials
    response = self._load_from_cache()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/credentials.py", line 711, in _load_from_cache
    creds = deepcopy(self._cache[self._cache_key])
                     ~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/me/json-cache-repro/lib/python3.11/site-packages/botocore/utils.py", line 3520, in __getitem__
    raise KeyError(cache_key)
KeyError: 'foo-botocore-test'

Possible Solution

Workaround

Delete the affected cache file, or delete all files in the cache dir (e.g. ~/.aws/cli/cache/).

Fix

This is tricky. JSONFileCache doesn't know what it's storing, and CachedCredentialFetcher doesn't know what kind of cache it's using. CachedCredentialFetcher assumes that if a key exists in the cache, it's valid to read, which is incorrect for JSONFileCache.

Suggestion:

  1. In JSONFileCache#__getitem__, catch (OSError, ValueError, json.decoder.JSONDecodeError) and re-raise any of them as a new CacheMissError.
  2. In CachedCredentialFetcher#_load_from_cache, catch CacheMissError and return None. This will hopefully trigger a credential refresh that will overwrite the invalid cache file.

Triage

If the maintainers prefer to raise exceptions here (maybe there's a lot of extra work to change behavior), in the meantime, the exception message in JSONFileCache#__getitem__ could be improved to suggest manual cleanup:

try:
    with open(actual_key) as f:
        return json.load(f)
except (OSError, ValueError, json.decoder.JSONDecodeError):
    raise KeyError(f'Cache file {actual_key} may be corrupt, please delete it.')

Additional Information/Context

It took me most of a day to track this down when it started causing awscli errors. I initially thought it was an AWS permissions issue, or a problem with my local credentials, or corrupted awscli configuration, or my environment, or something a reboot might fix. None of my teammates could reproduce it.

I was disappointed to find out it was a cache error!

Thanks for your consideration.

SDK version used

v1.34.28 (also observed in v2.0.0dev155, installed as awscli v2.15.13 dependency via homebrew)

Environment details (OS name and version, etc.)

MacOS 14.2.1 (arm64)

mildmojo avatar Jan 25 '24 21:01 mildmojo

Thinking about it last night, I also believe the current code will never heal a corrupted cache file on its own. With the current strategy, a cache entry is only updated if it doesn't exist yet (cache miss) or the data can be parsed and evaluated for expiration. A cache file that can't be read or parsed will break the library forever unless there's manual intervention to clear the cache on disk, and that intervention is less likely because the exceptions raised by the cache class don't contain enough information to suggest a solution.

mildmojo avatar Jan 26 '24 16:01 mildmojo

Thanks for reaching out and for your patience here. I could reproduce the errors you referenced using your code snippet. I will plan to bring this up with the team for discussion.

tim-finnigan avatar May 20 '24 16:05 tim-finnigan

After discussing with the team, we think that there may be some opportunities to improve the behavior here. I created a PR with the error message you suggested for the team to consider: https://github.com/boto/botocore/pull/3183. Further investigation may still be needed in terms of when the error occurs in Botocore/the AWS CLI.

tim-finnigan avatar May 21 '24 18:05 tim-finnigan

It looks like #3213 was closed in favor of this one, but this one looks to previously be about a confusing error message (and proposed fix to the message).

Is it safe to assume this one is now also about fixing the race condition (potentially with a flock) to avoid corrupting the cache file at all? Let me know if it’d be more helpful to submit an untested PR with a proposed fix. Thank you!

mikelambert avatar Jul 03 '24 01:07 mikelambert

Thanks for following up, this issue is more focused on the error message so I can reopen https://github.com/boto/botocore/issues/3213 for further review/discussion. You can submit a PR for this but I can't provide any guarantees on if or when it might be considered. I think more research is needed here into what can cause JSONFileCache to break and how that may be addressed.

tim-finnigan avatar Jul 11 '24 19:07 tim-finnigan

I think the error message is a good triage, and it's definitely worth exploring corruption root causes. But long-term, I also think it's worth making cache failures non-fatal. Maybe a warning, if there's value in communicating that this optimization has failed.

mildmojo avatar Jul 11 '24 20:07 mildmojo

think its been done in it for the mean time values for it.

amberkushwaha avatar Jul 25 '24 05:07 amberkushwaha

This error still occurs in botocore v1.37.7 using the reproduction steps above.

It also still occurs in awscli v2.24.18, which hasn't updated from embedding botocore v2.0.0dev155.

Luckily I remembered the symptom of awscli printing an inscrutable hash to the console instead of an error message when I hit it again today. 😁

I still believe this performance optimization failure should be non-fatal, with a warning message.

mildmojo avatar Mar 06 '25 19:03 mildmojo

Thanks! Adding aws --debug flag actually logs the exception so I was able to locate this issue

...
  File "/opt/homebrew/Cellar/awscli/2.27.3/libexec/lib/python3.13/site-packages/awscli/botocore/utils.py", line 3905, in __getitem__
    raise KeyError(cache_key)
KeyError: '1bd7289fd7217f02d4c4f9b67c21b6d938f9cd58'

'1bd7289fd7217f02d4c4f9b67c21b6d938f9cd58'

nijave avatar Apr 29 '25 17:04 nijave