scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

Process cookies from header only once

Open elacuesta opened this issue 4 years ago • 7 comments

:construction: Work in progress :construction:

A first approach to solving #4717. The idea is to process cookies from the header only once, the actual method of solving it can vary (meta key, request attribute, etc). I know it's not much, but I'd like your opinions @Gallaecio @kmike @wRAR.

The following spider works with this patch, while it breaks with the current master (5a386393) - I'm working on some tests to reproduce this behaviour.

import scrapy

class LoginSpider(scrapy.Spider):
    name = "login"
    start_urls = ["http://quotes.toscrape.com/login"]
    custom_settings = {"COOKIES_DEBUG": True}

    def parse(self, response):
        csrf_token = response.xpath("//input[@name='csrf_token']/@value").get()
        return scrapy.FormRequest.from_response(
            response=response,
            formxpath="//form",
            formdata={
                "csrf_token": csrf_token,
                "username": "admin",
                "password": "admin"
            },
            callback=self.after_login,
        )

    def after_login(self, response):
        logout = response.xpath("//a[text()='Logout']/text()").get()
        assert logout is not None

Fixes #1992

elacuesta avatar Sep 29 '20 17:09 elacuesta

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?

Gallaecio avatar Sep 30 '20 10:09 Gallaecio

Codecov Report

Merging #4812 into master will increase coverage by 0.02%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4812      +/-   ##
==========================================
+ Coverage   87.52%   87.54%   +0.02%     
==========================================
  Files         160      160              
  Lines        9761     9762       +1     
  Branches     1440     1440              
==========================================
+ Hits         8543     8546       +3     
+ Misses        955      954       -1     
+ Partials      263      262       -1     
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/cookies.py 95.78% <100.00%> (+0.04%) :arrow_up:
scrapy/core/downloader/__init__.py 92.48% <0.00%> (+1.50%) :arrow_up:

codecov[bot] avatar Sep 30 '20 12:09 codecov[bot]

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?

I don't think it would, once the header cookies are processed they are stored in the corresponding cookiejar and added to the appropriate requests. It would be an interesting test case though.

elacuesta avatar Sep 30 '20 12:09 elacuesta

What if a retry (instead of a redirect) needs to happen, because the server sent e.g. a 503 response. Wouldn’t the retry miss some required cookies?

I think that this and other similar cases (like previously mentioned 302s) required to be covered by tests first (maybe implemented in separate pull request).

It think it should be.. regular testcase with mockserver with defined endpoints that return:

  1. 302s with Set-Cookie header in response that trigger redirect middleware
  2. 503s with the same header that trigger retry middleware

GeorgeA92 avatar Sep 02 '23 08:09 GeorgeA92

I am not able to reproduce this issue on recent master

script.py
from scrapy import FormRequest
import scrapy
from scrapy.crawler import CrawlerProcess

class LoginSpider(scrapy.Spider):
    name = 'login'
    # allowed_domains = ['quotes.toscrape.com']

    start_urls = ['http://quotes.toscrape.com/login']

    def parse(self, response):
        csrf_token = response.xpath('//input[@name="csrf_token"]/@value').get()
        return FormRequest.from_response(
            response,
            # formname='',
            # formcss='',
            formxpath='//form',
            formdata={
                'csrf_token': csrf_token,
                'username': 'admin',
                'password': 'admin'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        logout = response.xpath("//a[text()='Logout']/text()").get()
        self.logger.info(logout)



if __name__ == "__main__":
    process = CrawlerProcess();
    process.crawl(LoginSpider);
    process.start()
log_output
2024-01-19 12:38:02 [scrapy.utils.log] INFO: Scrapy 2.10.1 started (bot: scrapybot)
2024-01-19 12:38:02 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22621-SP0
2024-01-19 12:38:02 [scrapy.addons] INFO: Enabled addons:
[]
2024-01-19 12:38:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-01-19 12:38:02 [scrapy.crawler] INFO: Overridden settings:
{}
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-01-19 12:38:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-01-19 12:38:02 [scrapy.core.engine] INFO: Spider opened
2024-01-19 12:38:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-01-19 12:38:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/login> (referer: None)
2024-01-19 12:38:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://quotes.toscrape.com/> from <POST http://quotes.toscrape.com/login>
2024-01-19 12:38:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: http://quotes.toscrape.com/login)
2024-01-19 12:38:03 [login] INFO: Logout
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-19 12:38:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1227,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 3638,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 0.815075,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 1, 19, 10, 38, 3, 562096, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 13788,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 4,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2024, 1, 19, 10, 38, 2, 747021, tzinfo=datetime.timezone.utc)}
2024-01-19 12:38:03 [scrapy.core.engine] INFO: Spider closed (finished)

From this test and on test proposed on [WIP] https://github.com/scrapy/scrapy/pull/6141 I don't see issue with 302's mentioned on https://github.com/scrapy/scrapy/issues/4717#issuecomment-679167459 . Proposed tests rely on possibility to look into CookieJars stored in CookiesMiddleware variable which is not trivial as we don't have implementation of https://github.com/scrapy/scrapy/issues/1878

GeorgeA92 avatar Jan 19 '24 11:01 GeorgeA92

@Gallaecio As far as I see initially it caused by this change in /pull/2400 And it got fixed later as result of https://github.com/scrapy/scrapy/commit/e865c4430e58a4faa0e0766b23830f8423d6167a#diff-dd2ebb59f519dd24355f1249d4f00a3ebc30f7dfaf39828a7bd461a144ca817cR58-R59 (related to https://github.com/advisories/GHSA-mfjm-vh54-3f96)

GeorgeA92 avatar Jan 19 '24 14:01 GeorgeA92

I believe the code example in the PR description is not what needs to be fixed at the moment.

I believe this was the timeline:

  1. #1992 was reported.
  2. #2400 fixed it.
  3. #4717 reported that #2400 introduced a regression, what the code in the description of this PR reproduces.
  4. This PR was created to work on a proper fix.
  5. #4823 reverted #2400 to remove the regression (and unfix #1992) until a proper fix could be done as part of this PR.

Due to #4823, the actual issue in the main branch now is not #4717, but #1992 (and fixing it in a way that does not cause #4717 again).

Gallaecio avatar Jan 19 '24 15:01 Gallaecio