cachecontrol icon indicating copy to clipboard operation
cachecontrol copied to clipboard

Cache Deserialization (part two) -- warning message incorrect?

Open kratsg opened this issue 6 years ago • 4 comments

This is related to #148 - but I would prefer not to necro. I'm seeing this repeatedly when I make the same request over and over again. I'll detail everything I'm using right now.

(stare) $ python --version
Python 3.7.3
(stare) $ pip --version
pip 19.2.3 from /Users/kratsg/.virtualenvs/stare/lib/python3.7/site-packages/pip (python 3.7)
Package             Version   Location               
------------------- --------- -----------------------
appdirs             1.4.3     
aspy.yaml           1.3.0     
atomicwrites        1.3.0     
attrs               19.1.0    
bandit              1.6.2     
betamax             0.8.1     
betamax-serializers 0.2.1     
black               19.3b0    
bleach              3.1.0     
bumpversion         0.5.3     
CacheControl        0.12.5    
certifi             2019.6.16 
cfgv                2.0.1     
chardet             3.0.4     
Click               7.0       
coverage            4.5.4     
docutils            0.15.2    
ecdsa               0.13.2    
future              0.17.1    
gitdb2              2.0.5     
GitPython           3.0.2     
identify            1.4.6     
idna                2.8       
importlib-metadata  0.19      
lockfile            0.12.2    
more-itertools      7.2.0     
msgpack             0.6.1     
nodeenv             1.3.3     
packaging           19.1      
pbr                 5.4.2     
pip                 19.2.3    
pkginfo             1.5.0.1   
pluggy              0.12.0    
pre-commit          1.18.2    
py                  1.8.0     
pyasn1              0.4.6     
pyflakes            2.1.1     
Pygments            2.4.2     
pyparsing           2.4.2     
pytest              5.1.1     
pytest-cov          2.7.1     
pytest-mock         1.10.4    
python-dotenv       0.10.3    
python-jose         3.0.1     
PyYAML              5.1.2     
readme-renderer     24.0      
requests            2.22.0    
requests-toolbelt   0.9.1     
rsa                 4.0       
setuptools          41.2.0    
six                 1.12.0    
smmap2              2.0.5     
stare               0.0.2     /Users/kratsg/stare/src
stevedore           1.30.1    
toml                0.10.0    
tqdm                4.35.0    
twine               1.13.0    
urllib3             1.25.3    
virtualenv          16.7.4    
wcwidth             0.1.7     
webencodings        0.5.1     
wheel               0.33.6    
zipp                0.6.0     

Whenever I make the same request over and over (no warning the first time when there's no cache) but on subsequent requests, I get a warning...

cache entry deserialization failed, entry ignored (controller.py:147)

which indicates this line here: https://github.com/ionrock/cachecontrol/blob/0234b80/cachecontrol/controller.py#L147

So I'm going manually through the steps to find where the crash/error is. I see that we're using v4 so msgpack is being used:

>>> from cachecontrol.serialize import Serializer
>>> s = Serializer()
>>> data = open('.webcache/6/0/a/6/d/60a6d7942f39f888d6dc7b3a89964aea0a63c8f708eef2bdefd601be', 'rb').read()
>>> s.loads('', data)

No error there, so digging in more (given that there's no error on the empty request, I suspect it is not related to the cache but to the request itself...). So reverse-engineering what serialize.py is doing:

>>> ver, data = data.split(b",", 1)
>>> ver
b'cc=4'
>>> ver = ver.split(b"=", 1)[-1].decode("ascii")
>>> ver
'4'
>>> cached = msgpack.loads(data, raw=False)
>>> cached
{'response': {'body': b'<body suppressed for now>', 'headers': {'Date': 'Tue, 27 Aug 2019 14:51:20 GMT', 'Server': 'Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_fcgid/2.3.9 PHP/5.4.16', 'Vary': 'Authorization,Accept-Encoding,User-Agent', 'X-Powered-By': 'PHP/5.4.16', 'Set-Cookie': 'PHPSESSID=<COOKIE>; path=/', 'Expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'Cache-Control': 'private', 'Pragma': 'no-cache', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Headers': 'API-KEY, Authorization', 'Access-Control-Allow-Methods': 'GET, POST, PUT, DELETE', 'Content-Encoding': 'gzip', 'Content-Length': '45806', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'application/json'}, 'status': 200, 'version': 11, 'reason': 'OK', 'strict': 0, 'decode_content': False}, 'vary': {'Authorization': 'Bearer <BEARER TOKEN>', 'Accept-Encoding': 'gzip, deflate', 'User-Agent': 'python-requests/2.22.0'}}

So now I prep a request to compare against to see how cachecontrol works in the prepare_response section

To do this part, I just insert pdb into the relevant section and just evaluate specific lines, but I found out that the problem is that the headers didn't match (and in fact, I'm making different requests since the Bearer Token changed between requests when I submit repeatedly since I'm not caching the access tokens).

https://github.com/ionrock/cachecontrol/blob/0234b80/cachecontrol/serialize.py#L115-L119

So what ends up happening, to me, is that the warning message being spit out is not very clear at all. In fact, it just (in this case) means that the headers didn't match -- so the cache should be rebuilt. That is cache deserialization did not fail.

kratsg avatar Aug 27 '19 15:08 kratsg

I've seen this too; the simplest way to repro is to send a request to an endpoint that serves a Vary: Accept response header, and then send a different Accept request header the second time. It looks to me that the None signals both a corrupt or unacceptable cached resource, and the simplest thing to do is just remove the warning message.

import logging
logging.basicConfig(level=logging.WARNING)

from cachecontrol import CacheControlAdapter
from cachecontrol.caches import SeparateBodyFileCache
from requests.adapters import HTTPAdapter
from requests.sessions import Session

cache_dir = "foo"
sess = Session()
cache_adapter = CacheControlAdapter(
    cache=SeparateBodyFileCache(cache_dir)
)
sess.mount("https://", cache_adapter)

sess.get("https://pypi.org/simple/cachecontrol", headers={"Accept": "text/html"})
sess.get("https://pypi.org/simple/cachecontrol", headers={"Accept": "application/vnd.pypi.simple.v1+json"})

thatch avatar Feb 15 '24 03:02 thatch

@thatch Is your solution just requesting same resource twice with different Accept? I don't understand, can you elaborate?

dvershinin avatar Feb 15 '24 12:02 dvershinin

@dvershinin sorry I wasn't clear -- that's just a simple repro, against a well-behaved server, showing the server using Vary: Accept is enough to trigger this message.

thatch avatar Feb 15 '24 16:02 thatch

My understanding based on @thatch's reproduction, is that there are perhaps certain keys or fields in the header that do not necessarily change the response from the server (for example the Accept could change the response, and so Vary: Accept is used to tell the cache that you need to also cache based on the value of the header's Accept). My guess is that cachecontrol is not ignoring most of the header fields (and not relying on Vary which admittedly is probably not as well-known?).

Particularly @thatch is expecting two separate caches to be made, rather than trying to (re)use the text/html cache for the json one.

kratsg avatar Feb 15 '24 22:02 kratsg