scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

Is there a difference between using playright and using scrapy-playlight?

Open sucream opened this issue 2 years ago • 3 comments

Hi. I think the results of using playright and scrappy-playright are different in some situations. When i use just playwright, it just propery worked. but same code in scrapy-playwright wasn't worked to me. I think it may be related to the header, but I couldn't find anything special. Can you explain this to me and what should i do?

System spec

mac os 12.4
python 3.9.9
Scrapy==2.6.1
scrapy-playwright==0.0.18

This is playwright code.

import asyncio
from playwright.async_api import async_playwright

async def run(playwright):
    browser = await playwright.chromium.launch(headless=False)
    page = await browser.new_page()

    await page.goto("https://www.daejeon.go.kr/drh/drhStoryDaejeonList.do?boardId=blog_0001&menuSeq=1479")

    await page.wait_for_load_state("networkidle")

    await page.click("//html/body/div[4]/div/div/div/div[2]/div/div/a[3]")

    await page.wait_for_timeout(10000)

    await page.click("//html/body/div[4]/div/div/div/div[2]/div/div/a[4]")

    await page.wait_for_timeout(10000)

    await browser.close()

async def main():
    async with async_playwright() as playwright:
        await run(playwright)

asyncio.run(main())

This is my scrapy-playwright code.

from scrapy import Spider, Request
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod

class Testpider(Spider):

    name = "test"

    custom_settings={
            "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
            "DOWNLOAD_HANDLERS": {
                "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
                "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            },
            "PLAYWRIGHT_PROCESS_REQUEST_HEADERS": None,
            "LOG_LEVEL": "INFO",
            "PLAYWRIGHT_LAUNCH_OPTIONS":{
                "headless": False,
                "channel": "chrome",
            },
            "PLAYWRIGHT_CONTEXTS": {
                "default": {
                    "viewport": {
                        "width": 1920,
                        "height": 1080,
                    },
                }
            },
        }

    def start_requests(self):
        yield Request(
            url="https://www.daejeon.go.kr/drh/drhStoryDaejeonList.do?boardId=blog_0001&menuSeq=1479",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_load_state", "networkidle"),
                ],
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        await page.click('''//html/body/div[4]/div/div/div/div[2]/div/div/a[3]''')


        await page.wait_for_timeout(10000)

        await page.click('''//html/body/div[4]/div/div/div/div[2]/div/div/a[4]''')

        await page.wait_for_timeout(10000)

        for key, val in response.request.headers.items():
            print(key, val)

        await page.close()

        return


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(Testpider)
    process.start()

Thanks a lot :)

sucream avatar Jun 24 '22 13:06 sucream

Ideally there should be no difference if you're using PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None. Please be explicit about what result are you getting vs what would you expect. Regardless of that, my first suggestion would be to use more specific selectors, based on attributes of the elements you want - //html/body/div[4]/div/div/div/div[2]/div/div/a[4] is a very unreliable one.

elacuesta avatar Jun 24 '22 20:06 elacuesta

I apologize for not explaining. await page.click('''//html/body/div[4]/div/div/div/div[2]/div/div/a[3]''') is next page's a tag.

image

In playwright, clicking xpath propery worked and switched to next page like this. image

But in scrapy-playwright, it clicked the next page, and seems like something happend(i saw alert that said "You have accessed the wrong path" in Korean and it closed itself). So I gave "headless": False option and i tried clicking the page manually using the mouse in the browser after the operation, but I couldn't go to the next page. I thought this site hated the scrapy client and separated it by a header. If so, I just wondered why playwright is executed normally, and script-playwright is not.

What i wanted

  • If i click a tag, go to next page and get next page's response like playwright.

What i got

  • scrapy-playwright click a tag but stil in same page and the site recognizes it as an abnormal approach.

sucream avatar Jun 25 '22 01:06 sucream

As you advised, I clicked using #content>div.paging>div>div>a:nth-child(4) css selector, but the same situation occurred.

I'll attach the video file. I hope this video will help you understand the situation.

  • Just playwright

https://user-images.githubusercontent.com/23046579/175754803-9ffe6db4-5e71-46a7-ab6e-ec9bc68be478.mov

  • scrapy-playwright

https://user-images.githubusercontent.com/23046579/175754827-70047c4f-0331-4dbc-88cd-d8d244504a5c.mov

sucream avatar Jun 25 '22 02:06 sucream

@elacuesta I'm seeing this behavior also, where the page will load ok with playwright, but will not load with scrapy-playwright. is there a way I can send you a private url to verify?

blacksteel1288 avatar Jan 07 '23 03:01 blacksteel1288

I don't do private consultancy, no. I verified this specific case was no longer an issue after #144, which fixed the fact that some requests had their method incorrectly overridden - that was the cause behind other open issues as well. You might want to try PLAYWRIGHT_PROCESS_REQUEST_HEADERS and/or debugging the headers that are sent to the target website. Also, please be aware of the recent conversation about cookies (#149).

elacuesta avatar Jan 07 '23 21:01 elacuesta