scrapy-playwright GoTo returns None for certain sites (never the first page)

GoTo returns None for certain sites (never the first page)

Open AlvinSartorTrityum opened this issue 1 year ago • 5 comments

Hi! I have a spider that uses playwright with a proxy. NOTE: the spider works as it should when the proxy is not needed and the proxy works, as the first page is correctly scraped.

This is what happens:

first page is scraped, I see that the ************* RESPONSE ************* log, so parse_item is hit once
links are extracted and set_playwright_true is called (the list of links is logged)
errors are raised: 'NoneType' object has no attribute 'all_headers'

It seems similar to https://github.com/scrapy-plugins/scrapy-playwright/issues/10 and https://github.com/scrapy-plugins/scrapy-playwright/issues/102 and I saw that a fix has been merged with https://github.com/scrapy-plugins/scrapy-playwright/pull/113 .

When will the fix be released to the next version? Will this fix this or it will just prevent the error from being risen? Any idea why using the proxy is causing such exception?

class PlaywrightSpiderWithProxy(CrawlSpider):
    name = "client-side-site"
    handle_httpstatus_list = [301, 302, 401, 403, 404, 408, 429, 500, 503]
    exclude_patterns: List[str] = []

    playwright_meta = {
        "playwright": True,
        "playwright_page_goto_kwargs": {"wait_until": "networkidle"},
    }

    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": "http://192.0.0.1:12345",
                "username": "username",
                "password": "password",
            },
        },
    }

    def __init__(self, **kwargs: Any):
        # ...
        self.rules = (
            Rule(
                LinkExtractor(allow=allow_path),
                callback=self.parse_item,
                process_request=self.set_playwright_true,
                follow=True,
            ),
        )
        # ...
        super().__init__(**kwargs)

    def start_requests(self) -> Iterator[Request]:
        yield Request(self.start_urls[0], meta=self.playwright_meta)

    def set_playwright_true(self, request: Request, response: Response):
        self.log("%s => %s " % (response.url, request.url), logging.INFO)
        request.meta.update(self.playwright_meta)
        return request

    def parse_start_url(self, response: Response) -> Dict[str, Any]:
        return self.parse_item(response)

    def parse_item(self, response: Response) -> Dict[str, Any]:
        self.log("************* RESPONSE *************", logging.INFO)
        return {
          #  ...
        }

Aug 03 '22 15:08 AlvinSartorTrityum

When will the fix be released to the next version?

I just released #113 as v0.0.20.

Will this fix this or it will just prevent the error from being risen?

A warning will be logged letting you know what happened and you will get a valid response, albeit without headers.

Any idea why using the proxy is causing such exception?

I'm still not quite sure why this error occurs, from the upstream docs I understand that it could happen if a page is reused to navigate to the same URL but with a different fragment, you don't seem to be reusing the page so that shouldn't be it.

Aug 03 '22 16:08 elacuesta

Thanks @elacuesta!

I've tried the new version and it does change something. As you said, a warning is logged and no exception is risen, but the page is not crawled/links are not extracted.

Aug 04 '22 06:08 AlvinSartorTrityum

Sadly I cannot reproduce with a minimal example using mitmproxy.

import logging

from scrapy import Request
from scrapy.spiders.crawl import CrawlSpider, Rule


class PlaywrightSpiderWithProxy(CrawlSpider):
    name = "proxy-spider"
    custom_settings = {
        "LOG_LEVEL": "INFO",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                # on a separate terminal:
                # ./mitmproxy --proxyauth "user:pass"
                "server": "http://127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            },
        },
    }
    rules = (
        Rule(
            callback="parse_item",
            process_request="set_playwright_true",
            follow=True,
        ),
    )

    playwright_meta = {
        "playwright": True,
        "playwright_page_goto_kwargs": {"wait_until": "networkidle"},
    }

    def start_requests(self):
        yield Request("http://books.toscrape.com/", meta=self.playwright_meta)

    def set_playwright_true(self, request, response):
        self.log("%s => %s " % (response.url, request.url), logging.INFO)
        request.meta.update(self.playwright_meta)
        return request

    def parse_item(self, response):
        link_count = len(response.css("a"))
        self.log(f"Response: {response}, link_count={link_count}", logging.INFO)
        return {"url": response.url, "link_count": link_count}

2022-08-07 13:37:13 [proxy-spider] INFO: Response: <200 http://books.toscrape.com/catalogue/category/books/music_14/index.html>, link_count=80
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/index.html 
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books_1/index.html 
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/travel_2/index.html 
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/mystery_3/index.html 
...
2022-08-07 13:37:13 [proxy-spider] INFO: Response: <200 http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html>, link_count=86
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html => http://books.toscrape.com/index.html 
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/religion_12/index.html 
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html 
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/music_14/index.html 
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/default_15/index.html 
...
2022-08-07 13:37:15 [proxy-spider] INFO: Response: <200 http://books.toscrape.com/index.html>, link_count=94
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/index.html 
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books_1/index.html 
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/travel_2/index.html 
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/mystery_3/index.html 
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html 
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html 
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/classics_6/index.html 
...

It seems like you might be getting empty responses from your proxy. Do you get the same behavior with a different proxy?

Aug 07 '22 16:08 elacuesta

Hi @elacuesta, still investigating this! We ran the spider from another country, without proxy and the results are the same, so it seems that the site is doing something strange that is breaking the Page.

2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://hc.support-discountcasino1.com/hc/tr 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/vip-programi 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/kampanyalar/hizinda-para-yatirma-yollari 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/kampanyalar/yeni-uyelere-ozel-1000-tl-nakit-iade 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://twitter.com/Discountcasino8 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://www.instagram.com/discount_casino/ 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://t.me/discountcasinocom 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/bilincli-oyun 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/license 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/sartlar 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/gizlilik-sozlesmesi 
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/oyun-kurallari 
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/vip-programi> returned None, the response will have empty headers and status 200
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/oyun-kurallari> returned None, the response will have empty headers and status 200
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/gizlilik-sozlesmesi> returned None, the response will have empty headers and status 200
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/sartlar> returned None, the response will have empty headers and status 200
2022-08-10 22:56:02 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/license> returned None, the response will have empty headers and status 200
2022-08-10 22:56:03 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/bilincli-oyun> returned None, the response will have empty headers and status 200
2022-08-10 22:56:03 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/kampanyalar/hizinda-para-yatirma-yollari> returned None, the response will have empty headers and status 200
2022-08-10 22:56:03 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/kampanyalar/yeni-uyelere-ozel-1000-tl-nakit-iade> returned None, the response will have empty headers and status 200

I guess you should also be able to reproduce if you used this site.

Aug 11 '22 09:08 AlvinSartorTrityum

I'll try to crawl this with Playwright only to see if the issue comes from there. Sorry for making this such a long investigation.

Aug 17 '22 08:08 AlvinSartorTrityum

scrapy-playwright scrapy-playwright copied to clipboard

GoTo returns None for certain sites (never the first page)

scrapy-playwright
scrapy-playwright copied to clipboard