scrapy
scrapy copied to clipboard
Process cookies from header only once
:construction: Work in progress :construction:
A first approach to solving #4717. The idea is to process cookies from the header only once, the actual method of solving it can vary (meta key, request attribute, etc). I know it's not much, but I'd like your opinions @Gallaecio @kmike @wRAR.
The following spider works with this patch, while it breaks with the current master (5a386393) - I'm working on some tests to reproduce this behaviour.
import scrapy
class LoginSpider(scrapy.Spider):
name = "login"
start_urls = ["http://quotes.toscrape.com/login"]
custom_settings = {"COOKIES_DEBUG": True}
def parse(self, response):
csrf_token = response.xpath("//input[@name='csrf_token']/@value").get()
return scrapy.FormRequest.from_response(
response=response,
formxpath="//form",
formdata={
"csrf_token": csrf_token,
"username": "admin",
"password": "admin"
},
callback=self.after_login,
)
def after_login(self, response):
logout = response.xpath("//a[text()='Logout']/text()").get()
assert logout is not None
Fixes #1992
What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?
Codecov Report
Merging #4812 into master will increase coverage by
0.02%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #4812 +/- ##
==========================================
+ Coverage 87.52% 87.54% +0.02%
==========================================
Files 160 160
Lines 9761 9762 +1
Branches 1440 1440
==========================================
+ Hits 8543 8546 +3
+ Misses 955 954 -1
+ Partials 263 262 -1
Impacted Files | Coverage Δ | |
---|---|---|
scrapy/downloadermiddlewares/cookies.py | 95.78% <100.00%> (+0.04%) |
:arrow_up: |
scrapy/core/downloader/__init__.py | 92.48% <0.00%> (+1.50%) |
:arrow_up: |
What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?
I don't think it would, once the header cookies are processed they are stored in the corresponding cookiejar and added to the appropriate requests. It would be an interesting test case though.
What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?
I think that this and other similar cases (like previously mentioned 302s) required to be covered by tests first (maybe implemented in separate pull request).
It think it should be.. regular testcase with mockserver
with defined endpoints that return:
- 302s with
Set-Cookie
header in response that trigger redirect middleware - 503s with the same header that trigger retry middleware
I am not able to reproduce this issue on recent master
script.py
from scrapy import FormRequest
import scrapy
from scrapy.crawler import CrawlerProcess
class LoginSpider(scrapy.Spider):
name = 'login'
# allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/login']
def parse(self, response):
csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()
return FormRequest.from_response(
response,
# formname='',
# formcss='',
formxpath='//form',
formdata={
'csrf_token': csrf_token,
'username': 'admin',
'password': 'admin'
},
callback=self.after_login
)
def after_login(self, response):
logout = response.xpath("//a[text()='Logout']/text()").get()
self.logger.info(logout)
if __name__ == "__main__":
process = CrawlerProcess();
process.crawl(LoginSpider);
process.start()
log_output
2024-01-19 12:38:02 [scrapy.utils.log] INFO: Scrapy 2.10.1 started (bot: scrapybot)
2024-01-19 12:38:02 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w 11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2024-01-19 12:38:02 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-19 12:38:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2024-01-19 12:38:02 [scrapy.crawler] INFO: Overridden settings:
{}
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-19 12:38:02 [scrapy.core.engine] INFO: Spider opened
2024-01-19 12:38:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-19 12:38:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None)
2024-01-19 12:38:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login>
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login)
2024-01-19 12:38:03 [login] INFO: Logout
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-19 12:38:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1227,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 3638,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 0.815075,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 1, 19, 10, 38, 3, 562096, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 13788,
'httpcompression/response_count': 2,
'log_count/DEBUG': 4,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2024, 1, 19, 10, 38, 2, 747021, tzinfo=datetime.timezone.utc)}
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Spider closed (finished)
From this test and on test proposed on [WIP] https://github.com/scrapy/scrapy/pull/6141 I don't see issue with 302's mentioned on https://github.com/scrapy/scrapy/issues/4717#issuecomment-679167459 . Proposed tests rely on possibility to look into CookieJars stored in CookiesMiddleware variable which is not trivial as we don't have implementation of https://github.com/scrapy/scrapy/issues/1878
@Gallaecio As far as I see initially it caused by this change in /pull/2400 And it got fixed later as result of https://github.com/scrapy/scrapy/commit/e865c4430e58a4faa0e0766b23830f8423d6167a#diff-dd2ebb59f519dd24355f1249d4f00a3ebc30f7dfaf39828a7bd461a144ca817cR58-R59 (related to https://github.com/advisories/GHSA-mfjm-vh54-3f96)
I believe the code example in the PR description is not what needs to be fixed at the moment.
I believe this was the timeline:
- #1992 was reported.
- #2400 fixed it.
- #4717 reported that #2400 introduced a regression, what the code in the description of this PR reproduces.
- This PR was created to work on a proper fix.
- #4823 reverted #2400 to remove the regression (and unfix #1992) until a proper fix could be done as part of this PR.
Due to #4823, the actual issue in the main branch now is not #4717, but #1992 (and fixing it in a way that does not cause #4717 again).