scrapy-splash
scrapy-splash copied to clipboard
Request fingerprint takes all headers and additional cookies into consideration
Apologies if this is stated anywhere in the documentation.
From what I can see when a request object gets a chance to be run through splash_request_fingerprint
it will have splash.args.headers
and splash.args.cookies
set on its meta by splash middlewares. And that makes dupefilter and cache facilities see different fingerprints for requests that kinda sorta should have been seen as identical.
I was considering moving middleware priorities around but for me this sorta change is insanely hard to debug / test or even to think about. For now as a solution I rewrote splash dupefilter and splash cachestorage to use a custom fingerprint generator which:
- Removes
cookies
andheaders
from meta.splash.args. - Removes any key which starts with a _ on meta.splash.
- Removes request body before calling vanilla fingerprint since all the stuff is also in there (I mostly use
/execute
endpoint).
Basic test project I checked this on: testing.zip Code for my workaround: https://gist.github.com/hermit-crab/0cbd4967c35c3959a76ba37edb99140a
Related to https://github.com/scrapy/scrapy/issues/900