scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

Support playwright_stealth

Open hqtang33 opened this issue 1 year ago • 1 comments

Integrated playwright_stealth, and PLAYWRIGHT_STEALTH_ENABLED as an optional config.

Attached bot test results.

PLAYWRIGHT_STEALTH_ENABLED = True ENABLED

PLAYWRIGHT_STEALTH_ENABLED = False DISABLED

hqtang33 avatar Jul 26 '22 14:07 hqtang33

Thank you very much for the contribution, but I don't want to include any third-party dependency unless it's really necessary. I've been thinking that one way to allow this functionality (and address #25 at the same time) would be to add a way to handle pages right after they are created (an idea I've already explored at https://github.com/scrapy-plugins/scrapy-playwright/issues/26#issuecomment-930182537). I'm imagining something like the following:

from scrapy import Spider, Request
from playwright.async_api import Page

async def new_page_handler(page: Page) -> None:
    await page.add_init_script("/path/to/script")
    # more stuff

class AwesomeSpider(Spider):
    def start_requests(self):
        yield Request(
            url="https://httpbin.org/get",
            meta={"playwright": True, "playwright_configure_page": new_page_handler},
        )

elacuesta avatar Jul 27 '22 19:07 elacuesta

For the record, this should be possible after #128

elacuesta avatar Oct 09 '22 21:10 elacuesta

Thank you very much for the contribution, but I don't want to include any third-party dependency unless it's really necessary. I've been thinking that one way to allow this functionality (and address #25 at the same time) would be to add a way to handle pages right after they are created (an idea I've already explored at #26 (comment)). I'm imagining something like the following:

from scrapy import Spider, Request
from playwright.async_api import Page

async def new_page_handler(page: Page) -> None:
    await page.add_init_script("/path/to/script")
    # more stuff

class AwesomeSpider(Spider):
    def start_requests(self):
        yield Request(
            url="https://httpbin.org/get",
            meta={"playwright": True, "playwright_configure_page": new_page_handler},
        )

It should be possible to include this with an optional pip dependency e.g. scrapy-playwright[with_playwright_stealth] to avoid requiring the dependency while also including this in the distribution

nimish avatar Nov 01 '22 15:11 nimish

It should be possible to include this with an optional pip dependency e.g. scrapy-playwright[with_playwright_stealth] to avoid requiring the dependency while also including this in the distribution

That's true, but it would still require changes to the main handler in order to support the integration - that's what I want to avoid. It's possible to integrate with this after v0.0.22, by using the playwright_page_init_callback request meta key:

from playwright_stealth import stealth_async

async def init_page(page, request):
    await stealth_async(page)

class StealthSpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta={
                "playwright": True,
                "playwright_page_init_callback": init_page,
            },
        )

elacuesta avatar Nov 01 '22 20:11 elacuesta

@hqtang33 Were you able to find a solution? I tried to include your changes proposed here and also your fork of the stealth plugin but unfortunately, even the "simple" removal of "Headless" doesn't work in the user-agent.

kinoute avatar Mar 01 '23 15:03 kinoute