scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

Response to/from dict, HTTP cache updates

Open elacuesta opened this issue 3 years ago • 7 comments

Continuation of #5130. Closes #1450, closes #1877, closes #4923.

This PR introduces a new feature to (de)serialize responses, which is used in the HTTP cache extension.

elacuesta avatar May 04 '21 17:05 elacuesta

Codecov Report

Merging #5131 (b4de33d) into master (9a28eb0) will decrease coverage by 0.01%. The diff coverage is 98.82%.

@@            Coverage Diff             @@
##           master    #5131      +/-   ##
==========================================
- Coverage   88.77%   88.76%   -0.02%     
==========================================
  Files         163      163              
  Lines       10666    10653      -13     
  Branches     1818     1812       -6     
==========================================
- Hits         9469     9456      -13     
  Misses        922      922              
  Partials      275      275              
Impacted Files Coverage Δ
scrapy/utils/response.py 92.06% <95.00%> (+1.86%) :arrow_up:
scrapy/extensions/httpcache.py 94.66% <100.00%> (-0.78%) :arrow_down:
scrapy/http/response/__init__.py 97.70% <100.00%> (+0.23%) :arrow_up:
scrapy/utils/httpobj.py 100.00% <100.00%> (ø)
scrapy/utils/request.py 100.00% <100.00%> (ø)

codecov[bot] avatar May 04 '21 17:05 codecov[bot]

Hey! Unlike request <=> dict conversion, which is used by Scrapy e.g. for the disk queues, here it seems the newly introduced method and function are not used by Scrapy itself. E.g. it is not used for storing of the cache data. I think that's not a great situation, because we're almost introducing an extension point, protocol for response serialization / deserialization here, but not using it in the framework itself.

What's the use case you see for these feature as it is now? Is it just debugging, as in the original ticket?

kmike avatar May 10 '21 19:05 kmike

@kmike I started working on this (on #4945, rather) after #4923. This would provide a way of dealing with cached responses in an uniform fashion, taking the burden off from the cache implementations themselves.

elacuesta avatar May 11 '21 12:05 elacuesta

I do wonder if maybe it makes sense to implement this along with the changes to cache handling that make use of this.

Gallaecio avatar May 11 '21 12:05 Gallaecio

This is technically backward-incompatible, but I don't think that's too much of a problem in practice (I believe invalidating an existing cache after a Scrapy version change is probably reasonable).

elacuesta avatar Jun 11 '21 17:06 elacuesta

I’m unsure (+0) on whether or not it’s fine.

I do wonder if merging this change would affect https://github.com/scrapy/scrapy/pull/5175 in any way (as in https://github.com/scrapy/scrapy/pull/5175 could be implemented differently if we are going to break caches anyway). I’m not sure either, though.

Gallaecio avatar Jun 13 '21 17:06 Gallaecio

@elacuesta What would you think of extracting a subset of these changes into a separate pull request, specifically Response.attributes and its use in replace, to address https://github.com/scrapy/scrapy/issues/1877? I suspect we’ll be able to merge that sooner.

Gallaecio avatar Jul 14 '21 08:07 Gallaecio