cachecontrol Not working with 'requests' module

I'm using python 3.6 with requests module for API consumption and CacheControl module for caching the API response. I'm using following code but cache does not seems to be working:

import requests
from cachecontrol import CacheControl

sess = requests.session()
cached_sess = CacheControl(sess)

response = cached_sess.get('https://jsonplaceholder.typicode.com/users')

Every request to this URL returns the 200 status code (instead of 304 status code) and the same resource is requested each time even though the ETag headers is same and max-age was still valid. The API returns following cache related headers:

'Cache-Control': 'public, max-age=14400'
'Expires': 'Sat, 04 Feb 2017 22:23:28 GMT' (time of original request)
'Etag': 'W/"160d-MxiAGkI3ZBrjm0xiEDfwqw"'

What could be the issue here?

UPDATE: I'm not sending If-None-Match header with any API call, do I manually have to do it or CacheControl module should take care of it automatically?

Feb 04 '17 18:02 imfaisalkh

'Expires': 'Sat, 04 Feb 2017 22:23:28 GMT' (time of original request)

Did you mean to say that the server is sending an Expires header that is basically datetime.now()? If so, I suspect that the server telling CacheControl that something expires immediately would be the cause.

Feb 04 '17 18:02 sigmavirus24

Yes, the Expires header value is set to the server time at which the resource is requested. But if 'Cache-Control' is allowing a max-age, should not it ignore the Expires header?

If both are sent, an HTTP/1.1 implementation SHOULD obey the Cache-control: max-age=N directive, and so SHOULD ignore the Expires field value.

https://www.w3.org/Protocols/HTTP/Issues/expire-cache.html

It's a public API for testing purpose, you can check the live response through any API client.

Feb 04 '17 19:02 imfaisalkh

I'm seeing the same issue.

First request response headers Cache-Control: max-age=604800, private 'Expires': 'Thu, 11 May 2017 15:35:34 GMT' (this is 5 minutes in the future) 'ETag': 'foo' response.from_cache: False

Subsequent request is immediately after. For the subsequent request, im adding the header "If-None-Match: foo". I don't see this being added by cachecontrol. response.from_cache: True http response 200 (not 304)

May 11 '17 15:05 robgil

>>> from datetime import datetime
>>> from requests import Session
>>> from cachecontrol import CacheControl
>>>
>>> session = CacheControl(Session())
>>> datetime.strftime(datetime.now(), "%Y-%m-%d %H:%M:%S")
'2018-09-27 16:49:49'
>>> response = session.get("https://<url i'm hitting with caching headers>")
>>> response.headers['Cache-Control'], response.headers['Expires']
('max-age=5054345, must-revalidate', 'Sun, 25 Nov 2018 08:48:54 GMT')
>>> response.from_cache
False
>>> response.status_code
200

I seem to be running into this issue as well unless I am misunderstanding the purpose of from_cache. I would think that if i cache this response, from_cache would return True if the response is not yet expired. I wasn't able to see from the documentation what method should be followed to check if a cached item has expired aside from asserting from_cache.

Sep 27 '18 20:09 ptdel

@robgil if I understand that response correctly, the 200 is from the original response. The server responds with a 304, which tells CacheControl to use the response it has, which is the 200.

@ptdel Based on that session in the Python repl, you'd have to make another request to get a from_cache value of True. The from_cache means the response came from the cache store, but on the first request, it wouldn't have been cached yet, so it wouldn't be set to True.

Sep 27 '18 21:09 ionrock

@ionrock thank you for clearing that up for me. If I want to test that a stored response has expired or not what is the proper method to use for this, or am I better off using a time delta? I apologize if I'm getting into off-topic territory to the issue

Sep 27 '18 21:09 ptdel

@ptdel Based on that session in the Python repl, you'd have to make another request to get a from_cache value of True. The from_cache means the response came from the cache store, but on the first request, it wouldn't have been cached yet, so it wouldn't be set to True.

This is not my experience. From the example in the usage, it seems the from_cache parameter is never set to True:

In [1]: import requests
   ...: import cachecontrol
   ...: 
   ...: sess = cachecontrol.CacheControl(requests.Session())
   ...: resp = sess.get('http://google.com')
   ...: 

In [3]: resp.status_code
Out[3]: 200

In [4]: resp.from_cache
Out[4]: False

In [5]: resp = sess.get('http://google.com')

In [6]: resp.from_cache
Out[6]: False

Then I figured, maybe it's google.com that's not sending the right headers! That's highly doubtful, but just in case, here's one case I know works reliably, by design, at least in my web browser (Firefox):

In [7]: import requests
   ...: import cachecontrol
   ...: 
   ...: sess = cachecontrol.CacheControl(requests.Session())
   ...: resp = sess.get('http://httpbin.org/cache')
   ...: 

In [8]: resp.from_cache
Out[8]: False

In [9]: resp = sess.get('http://httpbin.org/cache')

In [10]: resp.from_cache
Out[10]: False

In [11]:

@ionrock - i do believe there's a problem here. I haven't been able to diagnose it just yet, but it certainly seems from_cache doesn't get set properly.

Dec 07 '19 02:12 anarcat

and actually, it seems the response doesn't get cached at all. i.e. this fails at the first assert:

import requests
import cachecontrol


def main():
    sess = requests.Session()
    adapter = cachecontrol.CacheControlAdapter()
    sess.mount('http://', adapter)
    sess.mount('https://', adapter)
    resp = sess.get('http://httpbin.org/cache')
    assert adapter.cache.data
    assert not resp.from_cache
    resp = sess.get('http://httpbin.org/cache')
    assert resp.from_cache


if __name__ == '__main__':
    main()

when running in a debugger, adapter.cache.data is empty ({}).

something is clearly wrong here, then... even the built-in commandline tool fails with those URLs:

anarcat@angela:cachecontrol$ python3 _cmd.py http://httpbin.org/cache
Updating cache with response from "http://httpbin.org/cache"
Looking up "http://httpbin.org/cache" in the cache
No cache entry available
Not cached :(
anarcat@angela:cachecontrol$ python3 _cmd.py https://google.com/
Updating cache with response from "https://www.google.com/"
Looking up "https://www.google.com/" in the cache
No cache entry available
Not cached :(

from what i can tell, cache.set is called only from cache_response and update_cached_response, and that, in turn, is only called from CacheControlAdapter.build_response() which, in turn, is only called if CacheController.cached_request() is true.

but cache_request is true only if there was already an item in the cache.

so, in other words, cache entries are added only if a cache entry already exists, so cache entries are never added? did I miss something?

but maybe I did and the the CacheControlAdapter.build_response() does get called. even if it would be (and it isn't, I checked), there's still something fishy going on here:

https://github.com/ionrock/cachecontrol/blob/master/cachecontrol/adapter.py#L73-L117

or, inline and with irrelevant parts removed:

            # apply any expiration heuristics
            if response.status == 304:
                # [...]
                cached_response = self.controller.update_cached_response(
                    request, response
                )
                # [...]
            # We always cache the 301 responses
            elif response.status == 301:
                self.controller.cache_response(request, response)
            else:
                # Wrap the response file with a wrapper that will cache the
                #   response when the stream has been consumed.
                response._fp = CallbackFileWrapper(
                    response._fp,
                    functools.partial(
                        self.controller.cache_response, request, response
                    ),
                )
                if response.chunked:
                    super_update_chunk_length = response._update_chunk_length

                    def _update_chunk_length(self):
                        super_update_chunk_length()
                        if self.chunk_left == 0:
                            self._fp._close()

                    response._update_chunk_length = types.MethodType(
                        _update_chunk_length, response
                    )

If we have a 200 status code (as I do in my prototype), we're in the latter case, I believe, which means we only trigger the cache update after the _fp stream gets read. Yet that stream is a private member that never gets used by requests and even in cachecontrol it's called "dead code" somewhere else:

class Serializer(object):

    def dumps(self, request, response, body=None):
        response_headers = CaseInsensitiveDict(response.headers)

        if body is None:
            body = response.read(decode_content=False)

            # NOTE: 99% sure this is dead code. I'm only leaving it
            #       here b/c I don't have a test yet to prove
            #       it. Basically, before using
            #       `cachecontrol.filewrapper.CallbackFileWrapper`,
            #       this made an effort to reset the file handle. The
            #       `CallbackFileWrapper` short circuits this code by
            #       setting the body as the content is consumed, the
            #       result being a `body` argument is *always* passed
            #       into cache_response, and in turn,
            #       `Serializer.dump`.
            response._fp = io.BytesIO(body)

so maybe this worked at some point, when requests had an internal _fp object. it doesn't now, so that stuff never gets called, and entries never get cached here.

so this is definitely broken for me.

Dec 07 '19 03:12 anarcat

Only in the darkness can you see the stars. - Martin Luther King, Jr.

apparently, I read the code wrong. as it turns out, http://httpbin.org/cache gets cached by my browser, but not this module. that's a separate problem from "this module doesn't work at all". :) and in fact, the test suite here uses http://httpbin.org/cache/60 and that actually works fine.

maybe we're not caching aggressively enough, but that's a separate problem from "nothing works" which is what I thought was going on.

so: sorry for the noise, this actually works fine here and pretty well at that too!

thanks for this great module.

Dec 07 '19 03:12 anarcat

Closing as unactionable/no action needed at the moment; please let me know if that's wrong and I'll reopen!

Aug 01 '23 04:08 woodruffw

cachecontrol cachecontrol copied to clipboard

Not working with 'requests' module

cachecontrol
cachecontrol copied to clipboard