scrapy-playwright
scrapy-playwright copied to clipboard
Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright.
Hi,
I have a strange issue where I am receiving a 400 response from Google after clicking on the "I agree" button on their consent form.
This issue however does not appear if I click on the "Customise" button, nor does it happen if I perform the request via regular Playwright. I thought at first that it may be the proxy I am using, but that also works via regular Playwright.
Playwright code:
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(proxy={
'server': 'gb.smartproxy.com:30000',
})
page = await browser.new_page()
await page.goto('https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk')
await page.click('//span[contains(text(), "I agree")]')
await page.wait_for_load_state('domcontentloaded')
await page.screenshot(path='/home/ubuntu/test.png', full_page=True)
await browser.close()
asyncio.run(main())
scrapy-playwright code:
import scrapy
class GoogleSpider(scrapy.Spider):
name = "google_spider"
start_urls = ["data:,"]
custom_settings = {
'PLAYWRIGHT_LAUNCH_OPTIONS': {
'proxy': {
'server': 'http://gb.smartproxy.com:30000'
}
}
}
def start_requests(self):
yield scrapy.Request(
url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
callback=self.parse_page,
meta={
'playwright': True,
'playwright_include_page': True,
}
)
async def parse_page(self, response):
page = response.meta['playwright_page']
if 'consent' in page.url:
await page.screenshot(path='/home/ubuntu/span_button.png', full_page=True)
await page.click('//span[contains(text(), "I agree")]')
await page.wait_for_load_state()
await page.screenshot(path='/home/ubuntu/after_span.png', full_page=True)
await page.close()
What could be a reason for this? There is probably something simple I am missing here.
OS: Ubuntu 22.04 Python: 3.8.10 scrapy-playwright: 0.0.17
From a quick look, it seems like it might be due to the header processing done by scrapy-playwright. I'd suggest you to look into the PLAYWRIGHT_PROCESS_REQUEST_HEADERS
setting at https://github.com/scrapy-plugins/scrapy-playwright#supported-settings.
I have tried setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS
to None
in both custom_settings
and in the settings.py file but unfortunately I am still receiving the 400 response code.
I'm not able to reproduce, the site does not reply to me with a response that matches your code, i.e. no 'consent' in page.url
nor "I agree" button. It could be that I'm not using a proxy, I don't have credentials for the one you posted.
I'm not sure the proxy is the issue. If I don't use a proxy and use a "normal" user agent, then I don't get the consent page. However, if I supply the default scrapy user agent then do I get hit with the consent page, and I still get the 400 response code after clicking "I agree". Perhaps this would allow you to reproduce the issue? Also, would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.
Indeed, seems like the site doesn't like Scrapy's user agent. Besides that, I can't reproduce, either with or without PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
I get no consent page, just a page saying that my search had no results.
Regarding this:
would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.
Thanks! You just found a bug: #98
Thank you for fixing that bug! But, this is very strange. I have even started the scrapy project completely from scratch with a minimal script and I still get either a 400 or a 405 response code depending on the type of consent page that I get. I have attached my logs and script from this minimal setup. As I said, clicking on this consent page works absolutely fine in vanilla Playwright on the same machine, so I'm struggling to wrap my head around why this isn't working.
Spider
import scrapy
class GoogleTestSpider(scrapy.Spider):
name = 'google_test'
allowed_domains = ['google.com']
def start_requests(self):
yield scrapy.Request(
url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
callback=self.parse_page,
meta={
'playwright': True,
'playwright_include_page': True
}
)
async def parse_page(self, response):
print(response.request.headers['User-Agent'])
page = response.meta['playwright_page']
xpaths = '//span[contains(text(), "I agree")]|//span[contains(text(), "Accept all")]|//input[@value="I agree"]|//input[@value="Accept all"]|//span[contains(text(), "Reject all")]//input[@value="Reject all"]'
if 'consent' in response.url:
print('hit consent')
print(response.xpath(xpaths))
if not response.xpath(xpaths):
with open('/home/ubuntu/unknown.html', 'w') as w:
w.write(await page.content())
else:
print('######### FOUND ###########')
await page.click(xpaths)
await page.wait_for_load_state()
print('####### HAVE CLICKED ######')
await page.screenshot(path='/home/ubuntu/after.png', full_page=True)
await page.close()
Settings file
LOG_FILE='/home/ubuntu/google_scrape.log'
BOT_NAME = 'playwright_test'
SPIDER_MODULES = ['playwright_test.spiders']
NEWSPIDER_MODULE = 'playwright_test.spiders'
DOWNLOAD_HANDLERS = {
'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler'
}
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
PLAYWRIGHT_BROWSER_TYPE='firefox'
I've just tried this again and I still can't reproduce. With the code from this comment I get a a captcha with a message about suspicious traffic. If I remove all query params from the querystring except for the actual search string (q
param) I get a normal page saying there are no results.