cachecontrol
cachecontrol copied to clipboard
Not working with 'requests' module
I'm using python 3.6 with requests
module for API consumption and CacheControl
module for caching the API response. I'm using following code but cache does not seems to be working:
import requests
from cachecontrol import CacheControl
sess = requests.session()
cached_sess = CacheControl(sess)
response = cached_sess.get('https://jsonplaceholder.typicode.com/users')
Every request to this URL returns the 200
status code (instead of 304
status code) and the same resource is requested each time even though the ETag
headers is same and max-age
was still valid. The API returns following cache related headers:
'Cache-Control': 'public, max-age=14400'
'Expires': 'Sat, 04 Feb 2017 22:23:28 GMT' (time of original request)
'Etag': 'W/"160d-MxiAGkI3ZBrjm0xiEDfwqw"'
What could be the issue here?
UPDATE: I'm not sending If-None-Match
header with any API call, do I manually have to do it or CacheControl
module should take care of it automatically?
'Expires': 'Sat, 04 Feb 2017 22:23:28 GMT' (time of original request)
Did you mean to say that the server is sending an Expires header that is basically datetime.now()
? If so, I suspect that the server telling CacheControl that something expires immediately would be the cause.
Yes, the Expires
header value is set to the server time at which the resource is requested. But if 'Cache-Control' is allowing a max-age
, should not it ignore the Expires
header?
If both are sent, an HTTP/1.1 implementation SHOULD obey the Cache-control: max-age=N directive, and so SHOULD ignore the Expires field value.
https://www.w3.org/Protocols/HTTP/Issues/expire-cache.html
It's a public API for testing purpose, you can check the live response through any API client.
I'm seeing the same issue.
First request response headers Cache-Control: max-age=604800, private 'Expires': 'Thu, 11 May 2017 15:35:34 GMT' (this is 5 minutes in the future) 'ETag': 'foo' response.from_cache: False
Subsequent request is immediately after. For the subsequent request, im adding the header "If-None-Match: foo". I don't see this being added by cachecontrol. response.from_cache: True http response 200 (not 304)
>>> from datetime import datetime
>>> from requests import Session
>>> from cachecontrol import CacheControl
>>>
>>> session = CacheControl(Session())
>>> datetime.strftime(datetime.now(), "%Y-%m-%d %H:%M:%S")
'2018-09-27 16:49:49'
>>> response = session.get("https://<url i'm hitting with caching headers>")
>>> response.headers['Cache-Control'], response.headers['Expires']
('max-age=5054345, must-revalidate', 'Sun, 25 Nov 2018 08:48:54 GMT')
>>> response.from_cache
False
>>> response.status_code
200
I seem to be running into this issue as well unless I am misunderstanding the purpose of from_cache. I would think that if i cache this response, from_cache would return True if the response is not yet expired. I wasn't able to see from the documentation what method should be followed to check if a cached item has expired aside from asserting from_cache.
@robgil if I understand that response correctly, the 200 is from the original response. The server responds with a 304, which tells CacheControl to use the response it has, which is the 200.
@ptdel Based on that session in the Python repl, you'd have to make another request to get a from_cache
value of True
. The from_cache
means the response came from the cache store, but on the first request, it wouldn't have been cached yet, so it wouldn't be set to True
.
@ionrock thank you for clearing that up for me. If I want to test that a stored response has expired or not what is the proper method to use for this, or am I better off using a time delta? I apologize if I'm getting into off-topic territory to the issue
@ptdel Based on that session in the Python repl, you'd have to make another request to get a from_cache value of True. The from_cache means the response came from the cache store, but on the first request, it wouldn't have been cached yet, so it wouldn't be set to True.
This is not my experience. From the example in the usage, it seems the from_cache
parameter is never set to True:
In [1]: import requests
...: import cachecontrol
...:
...: sess = cachecontrol.CacheControl(requests.Session())
...: resp = sess.get('http://google.com')
...:
In [3]: resp.status_code
Out[3]: 200
In [4]: resp.from_cache
Out[4]: False
In [5]: resp = sess.get('http://google.com')
In [6]: resp.from_cache
Out[6]: False
Then I figured, maybe it's google.com that's not sending the right headers! That's highly doubtful, but just in case, here's one case I know works reliably, by design, at least in my web browser (Firefox):
In [7]: import requests
...: import cachecontrol
...:
...: sess = cachecontrol.CacheControl(requests.Session())
...: resp = sess.get('http://httpbin.org/cache')
...:
In [8]: resp.from_cache
Out[8]: False
In [9]: resp = sess.get('http://httpbin.org/cache')
In [10]: resp.from_cache
Out[10]: False
In [11]:
@ionrock - i do believe there's a problem here. I haven't been able to diagnose it just yet, but it certainly seems from_cache
doesn't get set properly.
and actually, it seems the response doesn't get cached at all. i.e. this fails at the first assert:
import requests
import cachecontrol
def main():
sess = requests.Session()
adapter = cachecontrol.CacheControlAdapter()
sess.mount('http://', adapter)
sess.mount('https://', adapter)
resp = sess.get('http://httpbin.org/cache')
assert adapter.cache.data
assert not resp.from_cache
resp = sess.get('http://httpbin.org/cache')
assert resp.from_cache
if __name__ == '__main__':
main()
when running in a debugger, adapter.cache.data
is empty ({}
).
something is clearly wrong here, then... even the built-in commandline tool fails with those URLs:
anarcat@angela:cachecontrol$ python3 _cmd.py http://httpbin.org/cache
Updating cache with response from "http://httpbin.org/cache"
Looking up "http://httpbin.org/cache" in the cache
No cache entry available
Not cached :(
anarcat@angela:cachecontrol$ python3 _cmd.py https://google.com/
Updating cache with response from "https://www.google.com/"
Looking up "https://www.google.com/" in the cache
No cache entry available
Not cached :(
from what i can tell, cache.set
is called only from cache_response
and update_cached_response
, and that, in turn, is only called from CacheControlAdapter.build_response()
which, in turn, is only called if CacheController.cached_request()
is true.
but cache_request is true only if there was already an item in the cache.
so, in other words, cache entries are added only if a cache entry already exists, so cache entries are never added? did I miss something?
but maybe I did and the the CacheControlAdapter.build_response()
does get called. even if it would be (and it isn't, I checked), there's still something fishy going on here:
https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/adapter.py#L73-L117
or, inline and with irrelevant parts removed:
# apply any expiration heuristics
if response.status == 304:
# [...]
cached_response = self.controller.update_cached_response(
request, response
)
# [...]
# We always cache the 301 responses
elif response.status == 301:
self.controller.cache_response(request, response)
else:
# Wrap the response file with a wrapper that will cache the
# response when the stream has been consumed.
response._fp = CallbackFileWrapper(
response._fp,
functools.partial(
self.controller.cache_response, request, response
),
)
if response.chunked:
super_update_chunk_length = response._update_chunk_length
def _update_chunk_length(self):
super_update_chunk_length()
if self.chunk_left == 0:
self._fp._close()
response._update_chunk_length = types.MethodType(
_update_chunk_length, response
)
If we have a 200 status code (as I do in my prototype), we're in the latter case, I believe, which means we only trigger the cache update after the _fp
stream gets read. Yet that stream is a private member that never gets used by requests and even in cachecontrol it's called "dead code" somewhere else:
class Serializer(object):
def dumps(self, request, response, body=None):
response_headers = CaseInsensitiveDict(response.headers)
if body is None:
body = response.read(decode_content=False)
# NOTE: 99% sure this is dead code. I'm only leaving it
# here b/c I don't have a test yet to prove
# it. Basically, before using
# `cachecontrol.filewrapper.CallbackFileWrapper`,
# this made an effort to reset the file handle. The
# `CallbackFileWrapper` short circuits this code by
# setting the body as the content is consumed, the
# result being a `body` argument is *always* passed
# into cache_response, and in turn,
# `Serializer.dump`.
response._fp = io.BytesIO(body)
so maybe this worked at some point, when requests had an internal _fp
object. it doesn't now, so that stuff never gets called, and entries never get cached here.
so this is definitely broken for me.
Only in the darkness can you see the stars. - Martin Luther King, Jr.
apparently, I read the code wrong. as it turns out, http://httpbin.org/cache gets cached by my browser, but not this module. that's a separate problem from "this module doesn't work at all". :) and in fact, the test suite here uses http://httpbin.org/cache/60 and that actually works fine.
maybe we're not caching aggressively enough, but that's a separate problem from "nothing works" which is what I thought was going on.
so: sorry for the noise, this actually works fine here and pretty well at that too!
thanks for this great module.
Closing as unactionable/no action needed at the moment; please let me know if that's wrong and I'll reopen!