scrapy-playwright icon indicating copy to clipboard operation
scrapy-playwright copied to clipboard

[question]: How to follow links using CrawlerSpider

Open okoliechykwuka opened this issue 1 year ago • 2 comments

I have had a hard time trying to follow links using the Scrapy Playwright to navigate a dynamic website.

want to write a crawl spider that will get all available odds information from https://oddsportal.com/ website. Some pages on the website are rendered using JavaScript, so I decided to use Scrapy Playwright.

Step 1.

I sent a request to the url of the website(https://oddsportal.com/results/) to return the content of the site(all links that I need to follow).

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_playwright.page import PageMethod
# from scrapy.utils.reactor import install_reactor

# install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')

class OddsportalSpider(CrawlSpider):
    name = 'oddsportal'
    allowed_domains = ['oddsportal.com']
    # start_urls = ['https://oddsportal.com/results/']


    def start_requests(self):
        url = 'https://oddsportal.com/results/'
        yield scrapy.Request(url=url, meta= dict(playwright =  True,
                                                playwright_context = 1,
                                                playwright_include_page = True,
                                                playwright_page_methods = [
                                                    PageMethod('wait_for_selector', 'div#col-content')
                                                    ]
                                        ))

Expected link from step one.

image

Step 2

Now I need to follow all the above links


def set_playwright_true(request, response):
        request.meta["playwright"] = True
        return request

rules = (
    Rule(LinkExtractor(restrict_xpaths="//div[@id= 'archive-tables']//tbody/tr[@xsid=1]/td/a"), callback='parse_item',follow=False, process_request=set_playwright_true),
    
    )

 async def parse_item(self, response):
        item = {}

        item['text'] = response.url

        yield item

When I run the above script, It doesn't get all the links from the https://oddsportal.com/results/, What am I doing wrong here. I believe I am not following the links rightly. The restrict_xpaths in the LinkExtractor is correct, because without Playwright I am able to extract the links but it does not yield the full content of the page.

All the links in the first image will take me to a page like this, which is rendered with JavaScript.

image

okoliechykwuka avatar Jul 26 '22 15:07 okoliechykwuka

The CrawlSpider does not support async def callbacks (they are not awaited, just invoked). Additionally, scrapy-playwright only requires async def callbacks if you are performing operations with the Page object, which doesn't seem to be the case.

There's also no need to set playwright_include_page=True. Apparently this is a common misconception. From the Receiving Page objects in callbacks section in the readme:

Caution Use this carefully, and only if you really need to do things with the Page object in the callback. (...)

elacuesta avatar Jul 26 '22 19:07 elacuesta

Perhaps I wasn't clear before, what I meant was that according to the code you shared you don't need playwright_include_page=True and thus also you don't to define your callback as async def. Removing those two should make your spider work as expected.

elacuesta avatar Aug 15 '22 16:08 elacuesta

Closing due to lack of feedback.

elacuesta avatar Oct 03 '22 19:10 elacuesta