scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

Request fingerprint takes all headers and additional cookies into consideration

Open hermit-crab opened this issue 6 years ago • 1 comments

Apologies if this is stated anywhere in the documentation.

From what I can see when a request object gets a chance to be run through splash_request_fingerprint it will have splash.args.headers and splash.args.cookies set on its meta by splash middlewares. And that makes dupefilter and cache facilities see different fingerprints for requests that kinda sorta should have been seen as identical.

I was considering moving middleware priorities around but for me this sorta change is insanely hard to debug / test or even to think about. For now as a solution I rewrote splash dupefilter and splash cachestorage to use a custom fingerprint generator which:

  • Removes cookies and headers from meta.splash.args.
  • Removes any key which starts with a _ on meta.splash.
  • Removes request body before calling vanilla fingerprint since all the stuff is also in there (I mostly use /execute endpoint).

Basic test project I checked this on: testing.zip Code for my workaround: https://gist.github.com/hermit-crab/0cbd4967c35c3959a76ba37edb99140a

hermit-crab avatar May 23 '18 07:05 hermit-crab

Related to https://github.com/scrapy/scrapy/issues/900

Gallaecio avatar May 09 '19 11:05 Gallaecio