scrapy-playwright
scrapy-playwright copied to clipboard
Is there a difference between using playright and using scrapy-playlight?
Hi. I think the results of using playright and scrappy-playright are different in some situations. When i use just playwright, it just propery worked. but same code in scrapy-playwright wasn't worked to me. I think it may be related to the header, but I couldn't find anything special. Can you explain this to me and what should i do?
System spec
mac os 12.4
python 3.9.9
Scrapy==2.6.1
scrapy-playwright==0.0.18
This is playwright code.
import asyncio
from playwright.async_api import async_playwright
async def run(playwright):
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://www.daejeon.go.kr/drh/drhStoryDaejeonList.do?boardId=blog_0001&menuSeq=1479")
await page.wait_for_load_state("networkidle")
await page.click("//html/body/div[4]/div/div/div/div[2]/div/div/a[3]")
await page.wait_for_timeout(10000)
await page.click("//html/body/div[4]/div/div/div/div[2]/div/div/a[4]")
await page.wait_for_timeout(10000)
await browser.close()
async def main():
async with async_playwright() as playwright:
await run(playwright)
asyncio.run(main())
This is my scrapy-playwright code.
from scrapy import Spider, Request
from scrapy.crawler import CrawlerProcess
from scrapy_playwright.page import PageMethod
class Testpider(Spider):
name = "test"
custom_settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_PROCESS_REQUEST_HEADERS": None,
"LOG_LEVEL": "INFO",
"PLAYWRIGHT_LAUNCH_OPTIONS":{
"headless": False,
"channel": "chrome",
},
"PLAYWRIGHT_CONTEXTS": {
"default": {
"viewport": {
"width": 1920,
"height": 1080,
},
}
},
}
def start_requests(self):
yield Request(
url="https://www.daejeon.go.kr/drh/drhStoryDaejeonList.do?boardId=blog_0001&menuSeq=1479",
callback=self.parse,
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_load_state", "networkidle"),
],
},
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.click('''//html/body/div[4]/div/div/div/div[2]/div/div/a[3]''')
await page.wait_for_timeout(10000)
await page.click('''//html/body/div[4]/div/div/div/div[2]/div/div/a[4]''')
await page.wait_for_timeout(10000)
for key, val in response.request.headers.items():
print(key, val)
await page.close()
return
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(Testpider)
process.start()
Thanks a lot :)
Ideally there should be no difference if you're using PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
. Please be explicit about what result are you getting vs what would you expect. Regardless of that, my first suggestion would be to use more specific selectors, based on attributes of the elements you want - //html/body/div[4]/div/div/div/div[2]/div/div/a[4]
is a very unreliable one.
I apologize for not explaining. await page.click('''//html/body/div[4]/div/div/div/div[2]/div/div/a[3]''')
is next page's a tag
.
In playwright
, clicking xpath propery worked and switched to next page like this.
But in scrapy-playwright
, it clicked the next page, and seems like something happend(i saw alert that said "You have accessed the wrong path" in Korean and it closed itself). So I gave "headless": False
option and i tried clicking the page manually using the mouse in the browser after the operation, but I couldn't go to the next page. I thought this site hated the scrapy client and separated it by a header. If so, I just wondered why playwright is executed normally, and script-playwright is not.
What i wanted
- If i click
a tag
, go to next page and get next page's response like playwright.
What i got
- scrapy-playwright click
a tag
but stil in same page and the site recognizes it as an abnormal approach.
As you advised, I clicked using #content>div.paging>div>div>a:nth-child(4)
css selector, but the same situation occurred.
I'll attach the video file. I hope this video will help you understand the situation.
- Just playwright
https://user-images.githubusercontent.com/23046579/175754803-9ffe6db4-5e71-46a7-ab6e-ec9bc68be478.mov
- scrapy-playwright
https://user-images.githubusercontent.com/23046579/175754827-70047c4f-0331-4dbc-88cd-d8d244504a5c.mov
@elacuesta I'm seeing this behavior also, where the page will load ok with playwright, but will not load with scrapy-playwright. is there a way I can send you a private url to verify?
I don't do private consultancy, no. I verified this specific case was no longer an issue after #144, which fixed the fact that some requests had their method incorrectly overridden - that was the cause behind other open issues as well. You might want to try PLAYWRIGHT_PROCESS_REQUEST_HEADERS
and/or debugging the headers that are sent to the target website. Also, please be aware of the recent conversation about cookies (#149).