scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

Receiving a 400 response after clicking "I agree" on the consent form on Google, but not when running through regular Playwright.

Open LTWood opened this issue 2 years ago • 7 comments

Hi,

I have a strange issue where I am receiving a 400 response from Google after clicking on the "I agree" button on their consent form.

after_span

This issue however does not appear if I click on the "Customise" button, nor does it happen if I perform the request via regular Playwright. I thought at first that it may be the proxy I am using, but that also works via regular Playwright.

Playwright code:

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(proxy={
            'server': 'gb.smartproxy.com:30000',
        })
        page = await browser.new_page()
        await page.goto('https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk')
        await page.click('//span[contains(text(), "I agree")]')
        await page.wait_for_load_state('domcontentloaded')
        await page.screenshot(path='/home/ubuntu/test.png', full_page=True)
        await browser.close()

asyncio.run(main())

scrapy-playwright code:

import scrapy


class GoogleSpider(scrapy.Spider):
    name = "google_spider"
    start_urls = ["data:,"]

    custom_settings = {
        'PLAYWRIGHT_LAUNCH_OPTIONS': {
            'proxy': {
                'server': 'http://gb.smartproxy.com:30000'
            }
        }
    }

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
            callback=self.parse_page,
            meta={
                'playwright': True,
                'playwright_include_page': True,
            }
        )

    async def parse_page(self, response):
        page = response.meta['playwright_page']
        if 'consent' in page.url:
            await page.screenshot(path='/home/ubuntu/span_button.png', full_page=True)
            await page.click('//span[contains(text(), "I agree")]')
            await page.wait_for_load_state()
            await page.screenshot(path='/home/ubuntu/after_span.png', full_page=True)
        await page.close()

What could be a reason for this? There is probably something simple I am missing here.

OS: Ubuntu 22.04 Python: 3.8.10 scrapy-playwright: 0.0.17

LTWood avatar Jun 14 '22 17:06 LTWood

From a quick look, it seems like it might be due to the header processing done by scrapy-playwright. I'd suggest you to look into the PLAYWRIGHT_PROCESS_REQUEST_HEADERS setting at https://github.com/scrapy-plugins/scrapy-playwright#supported-settings.

elacuesta avatar Jun 15 '22 20:06 elacuesta

I have tried setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS to None in both custom_settings and in the settings.py file but unfortunately I am still receiving the 400 response code.

LTWood avatar Jun 15 '22 22:06 LTWood

I'm not able to reproduce, the site does not reply to me with a response that matches your code, i.e. no 'consent' in page.url nor "I agree" button. It could be that I'm not using a proxy, I don't have credentials for the one you posted.

elacuesta avatar Jun 18 '22 03:06 elacuesta

I'm not sure the proxy is the issue. If I don't use a proxy and use a "normal" user agent, then I don't get the consent page. However, if I supply the default scrapy user agent then do I get hit with the consent page, and I still get the 400 response code after clicking "I agree". Perhaps this would allow you to reproduce the issue? Also, would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.

LTWood avatar Jun 18 '22 14:06 LTWood

Indeed, seems like the site doesn't like Scrapy's user agent. Besides that, I can't reproduce, either with or without PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None I get no consent page, just a page saying that my search had no results.

Regarding this:

would I be correct in saying that by setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None, the User-Agent shouldn't be the default scrapy user agent, but rather the user agent set by playwright? Because when I have set it to None and then checked the request headers, the user agent is the default scrapy one.

Thanks! You just found a bug: #98

elacuesta avatar Jun 18 '22 18:06 elacuesta

Thank you for fixing that bug! But, this is very strange. I have even started the scrapy project completely from scratch with a minimal script and I still get either a 400 or a 405 response code depending on the type of consent page that I get. I have attached my logs and script from this minimal setup. As I said, clicking on this consent page works absolutely fine in vanilla Playwright on the same machine, so I'm struggling to wrap my head around why this isn't working.

Spider

import scrapy


class GoogleTestSpider(scrapy.Spider):
    name = 'google_test'
    allowed_domains = ['google.com']

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.google.com/search?q=05055775403308&tbm=shop&uule=w+CAIQICIdTG9uZG9uLEVuZ2xhbmQsVW5pdGVkIEtpbmdkb20=&hl=en&gl=uk',
            callback=self.parse_page,
            meta={
                'playwright': True,
                'playwright_include_page': True
            }
        )

    async def parse_page(self, response):
        print(response.request.headers['User-Agent'])
        page = response.meta['playwright_page']
        xpaths = '//span[contains(text(), "I agree")]|//span[contains(text(), "Accept all")]|//input[@value="I agree"]|//input[@value="Accept all"]|//span[contains(text(), "Reject all")]//input[@value="Reject all"]'
        if 'consent' in response.url:
            print('hit consent')
            print(response.xpath(xpaths))
            if not response.xpath(xpaths):
                with open('/home/ubuntu/unknown.html', 'w') as w:
                    w.write(await page.content())
            else:
                print('######### FOUND ###########')
                await page.click(xpaths)
                await page.wait_for_load_state()
                print('####### HAVE CLICKED ######')
                await page.screenshot(path='/home/ubuntu/after.png', full_page=True)
        await page.close()

Settings file

LOG_FILE='/home/ubuntu/google_scrape.log'

BOT_NAME = 'playwright_test'

SPIDER_MODULES = ['playwright_test.spiders']
NEWSPIDER_MODULE = 'playwright_test.spiders'

DOWNLOAD_HANDLERS = {
    'http': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
    'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler'
}

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None
PLAYWRIGHT_BROWSER_TYPE='firefox'

google_scrape.log

LTWood avatar Jun 19 '22 14:06 LTWood

I've just tried this again and I still can't reproduce. With the code from this comment I get a a captcha with a message about suspicious traffic. If I remove all query params from the querystring except for the actual search string (q param) I get a normal page saying there are no results.

elacuesta avatar Nov 29 '23 00:11 elacuesta