scrapy-playwright
scrapy-playwright copied to clipboard
Support playwright_stealth
Integrated playwright_stealth, and PLAYWRIGHT_STEALTH_ENABLED as an optional config.
Attached bot test results.
PLAYWRIGHT_STEALTH_ENABLED = True
PLAYWRIGHT_STEALTH_ENABLED = False
Thank you very much for the contribution, but I don't want to include any third-party dependency unless it's really necessary. I've been thinking that one way to allow this functionality (and address #25 at the same time) would be to add a way to handle pages right after they are created (an idea I've already explored at https://github.com/scrapy-plugins/scrapy-playwright/issues/26#issuecomment-930182537). I'm imagining something like the following:
from scrapy import Spider, Request
from playwright.async_api import Page
async def new_page_handler(page: Page) -> None:
await page.add_init_script("/path/to/script")
# more stuff
class AwesomeSpider(Spider):
def start_requests(self):
yield Request(
url="https://httpbin.org/get",
meta={"playwright": True, "playwright_configure_page": new_page_handler},
)
For the record, this should be possible after #128
Thank you very much for the contribution, but I don't want to include any third-party dependency unless it's really necessary. I've been thinking that one way to allow this functionality (and address #25 at the same time) would be to add a way to handle pages right after they are created (an idea I've already explored at #26 (comment)). I'm imagining something like the following:
from scrapy import Spider, Request from playwright.async_api import Page async def new_page_handler(page: Page) -> None: await page.add_init_script("/path/to/script") # more stuff class AwesomeSpider(Spider): def start_requests(self): yield Request( url="https://httpbin.org/get", meta={"playwright": True, "playwright_configure_page": new_page_handler}, )
It should be possible to include this with an optional pip dependency e.g. scrapy-playwright[with_playwright_stealth]
to avoid requiring the dependency while also including this in the distribution
It should be possible to include this with an optional pip dependency e.g.
scrapy-playwright[with_playwright_stealth]
to avoid requiring the dependency while also including this in the distribution
That's true, but it would still require changes to the main handler in order to support the integration - that's what I want to avoid.
It's possible to integrate with this after v0.0.22, by using the playwright_page_init_callback
request meta key:
from playwright_stealth import stealth_async
async def init_page(page, request):
await stealth_async(page)
class StealthSpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(
url="https://example.org",
meta={
"playwright": True,
"playwright_page_init_callback": init_page,
},
)
@hqtang33 Were you able to find a solution? I tried to include your changes proposed here and also your fork of the stealth plugin but unfortunately, even the "simple" removal of "Headless" doesn't work in the user-agent.