scrapy-playwright
scrapy-playwright copied to clipboard
GoTo returns None for certain sites (never the first page)
Hi! I have a spider that uses playwright with a proxy. NOTE: the spider works as it should when the proxy is not needed and the proxy works, as the first page is correctly scraped.
This is what happens:
- first page is scraped, I see that the
************* RESPONSE *************
log, soparse_item
is hit once - links are extracted and
set_playwright_true
is called (the list of links is logged) - errors are raised:
'NoneType' object has no attribute 'all_headers'
It seems similar to https://github.com/scrapy-plugins/scrapy-playwright/issues/10 and https://github.com/scrapy-plugins/scrapy-playwright/issues/102 and I saw that a fix has been merged with https://github.com/scrapy-plugins/scrapy-playwright/pull/113 .
When will the fix be released to the next version? Will this fix this or it will just prevent the error from being risen? Any idea why using the proxy is causing such exception?
class PlaywrightSpiderWithProxy(CrawlSpider):
name = "client-side-site"
handle_httpstatus_list = [301, 302, 401, 403, 404, 408, 429, 500, 503]
exclude_patterns: List[str] = []
playwright_meta = {
"playwright": True,
"playwright_page_goto_kwargs": {"wait_until": "networkidle"},
}
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"proxy": {
"server": "http://192.0.0.1:12345",
"username": "username",
"password": "password",
},
},
}
def __init__(self, **kwargs: Any):
# ...
self.rules = (
Rule(
LinkExtractor(allow=allow_path),
callback=self.parse_item,
process_request=self.set_playwright_true,
follow=True,
),
)
# ...
super().__init__(**kwargs)
def start_requests(self) -> Iterator[Request]:
yield Request(self.start_urls[0], meta=self.playwright_meta)
def set_playwright_true(self, request: Request, response: Response):
self.log("%s => %s " % (response.url, request.url), logging.INFO)
request.meta.update(self.playwright_meta)
return request
def parse_start_url(self, response: Response) -> Dict[str, Any]:
return self.parse_item(response)
def parse_item(self, response: Response) -> Dict[str, Any]:
self.log("************* RESPONSE *************", logging.INFO)
return {
# ...
}
When will the fix be released to the next version?
I just released #113 as v0.0.20.
Will this fix this or it will just prevent the error from being risen?
A warning will be logged letting you know what happened and you will get a valid response, albeit without headers.
Any idea why using the proxy is causing such exception?
I'm still not quite sure why this error occurs, from the upstream docs I understand that it could happen if a page is reused to navigate to the same URL but with a different fragment, you don't seem to be reusing the page so that shouldn't be it.
Thanks @elacuesta!
I've tried the new version and it does change something. As you said, a warning is logged and no exception is risen, but the page is not crawled/links are not extracted.
Sadly I cannot reproduce with a minimal example using mitmproxy.
import logging
from scrapy import Request
from scrapy.spiders.crawl import CrawlSpider, Rule
class PlaywrightSpiderWithProxy(CrawlSpider):
name = "proxy-spider"
custom_settings = {
"LOG_LEVEL": "INFO",
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"proxy": {
# on a separate terminal:
# ./mitmproxy --proxyauth "user:pass"
"server": "http://127.0.0.1:8080",
"username": "user",
"password": "pass",
},
},
}
rules = (
Rule(
callback="parse_item",
process_request="set_playwright_true",
follow=True,
),
)
playwright_meta = {
"playwright": True,
"playwright_page_goto_kwargs": {"wait_until": "networkidle"},
}
def start_requests(self):
yield Request("http://books.toscrape.com/", meta=self.playwright_meta)
def set_playwright_true(self, request, response):
self.log("%s => %s " % (response.url, request.url), logging.INFO)
request.meta.update(self.playwright_meta)
return request
def parse_item(self, response):
link_count = len(response.css("a"))
self.log(f"Response: {response}, link_count={link_count}", logging.INFO)
return {"url": response.url, "link_count": link_count}
2022-08-07 13:37:13 [proxy-spider] INFO: Response: <200 http://books.toscrape.com/catalogue/category/books/music_14/index.html>, link_count=80
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/index.html
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books_1/index.html
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/travel_2/index.html
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/mystery_3/index.html
...
2022-08-07 13:37:13 [proxy-spider] INFO: Response: <200 http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html>, link_count=86
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html => http://books.toscrape.com/index.html
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/religion_12/index.html
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/nonfiction_13/index.html
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/music_14/index.html
2022-08-07 13:37:13 [proxy-spider] INFO: http://books.toscrape.com/catalogue/category/books/music_14/index.html => http://books.toscrape.com/catalogue/category/books/default_15/index.html
...
2022-08-07 13:37:15 [proxy-spider] INFO: Response: <200 http://books.toscrape.com/index.html>, link_count=94
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/index.html
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books_1/index.html
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/travel_2/index.html
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/mystery_3/index.html
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html
2022-08-07 13:37:15 [proxy-spider] INFO: http://books.toscrape.com/index.html => http://books.toscrape.com/catalogue/category/books/classics_6/index.html
...
It seems like you might be getting empty responses from your proxy. Do you get the same behavior with a different proxy?
Hi @elacuesta, still investigating this! We ran the spider from another country, without proxy and the results are the same, so it seems that the site is doing something strange that is breaking the Page.
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://hc.support-discountcasino1.com/hc/tr
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/vip-programi
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/kampanyalar/hizinda-para-yatirma-yollari
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/kampanyalar/yeni-uyelere-ozel-1000-tl-nakit-iade
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://twitter.com/Discountcasino8
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://www.instagram.com/discount_casino/
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://t.me/discountcasinocom
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/bilincli-oyun
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/license
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/sartlar
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/gizlilik-sozlesmesi
2022-08-10 22:55:58 [client-side-site] INFO: https://discountcasino307.com/ => https://discountcasino307.com/tr/oyun-kurallari
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/vip-programi> returned None, the response will have empty headers and status 200
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/oyun-kurallari> returned None, the response will have empty headers and status 200
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/gizlilik-sozlesmesi> returned None, the response will have empty headers and status 200
2022-08-10 22:56:01 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/sartlar> returned None, the response will have empty headers and status 200
2022-08-10 22:56:02 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/license> returned None, the response will have empty headers and status 200
2022-08-10 22:56:03 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/bilincli-oyun> returned None, the response will have empty headers and status 200
2022-08-10 22:56:03 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/kampanyalar/hizinda-para-yatirma-yollari> returned None, the response will have empty headers and status 200
2022-08-10 22:56:03 [scrapy-playwright] WARNING: Navigating to <GET https://discountcasino307.com/tr/kampanyalar/yeni-uyelere-ozel-1000-tl-nakit-iade> returned None, the response will have empty headers and status 200
I guess you should also be able to reproduce if you used this site.
I'll try to crawl this with Playwright only to see if the issue comes from there. Sorry for making this such a long investigation.