scrapy-playwright
scrapy-playwright copied to clipboard
[question]: How to follow links using CrawlerSpider
I have had a hard time trying to follow links using the Scrapy Playwright to navigate a dynamic website.
want to write a crawl spider that will get all available odds information from https://oddsportal.com/ website. Some pages on the website are rendered using JavaScript, so I decided to use Scrapy Playwright.
Step 1.
I sent a request to the url of the website(https://oddsportal.com/results/) to return the content of the site(all links that I need to follow).
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_playwright.page import PageMethod
# from scrapy.utils.reactor import install_reactor
# install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
class OddsportalSpider(CrawlSpider):
name = 'oddsportal'
allowed_domains = ['oddsportal.com']
# start_urls = ['https://oddsportal.com/results/']
def start_requests(self):
url = 'https://oddsportal.com/results/'
yield scrapy.Request(url=url, meta= dict(playwright = True,
playwright_context = 1,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector', 'div#col-content')
]
))
Expected link from step one.
Step 2
Now I need to follow all the above links
def set_playwright_true(request, response):
request.meta["playwright"] = True
return request
rules = (
Rule(LinkExtractor(restrict_xpaths="//div[@id= 'archive-tables']//tbody/tr[@xsid=1]/td/a"), callback='parse_item',follow=False, process_request=set_playwright_true),
)
async def parse_item(self, response):
item = {}
item['text'] = response.url
yield item
When I run the above script, It doesn't get all the links from the https://oddsportal.com/results/, What am I doing wrong here. I believe I am not following the links rightly. The restrict_xpaths in the LinkExtractor is correct, because without Playwright I am able to extract the links but it does not yield the full content of the page.
All the links in the first image will take me to a page like this, which is rendered with JavaScript.
The CrawlSpider does not support async def
callbacks (they are not awaited, just invoked). Additionally, scrapy-playwright only requires async def
callbacks if you are performing operations with the Page object, which doesn't seem to be the case.
There's also no need to set playwright_include_page=True
. Apparently this is a common misconception. From the Receiving Page objects in callbacks section in the readme:
Caution Use this carefully, and only if you really need to do things with the Page object in the callback. (...)
Perhaps I wasn't clear before, what I meant was that according to the code you shared you don't need playwright_include_page=True
and thus also you don't to define your callback as async def
.
Removing those two should make your spider work as expected.
Closing due to lack of feedback.