scrapy-splash scrapy-splash recursive crawl using CrawlSpider not working

scrapy-splash recursive crawl using CrawlSpider not working

Open dijadev opened this issue 7 years ago • 36 comments

Hi !

I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:

 def process_request(self,request):
        request.meta['splash']={
            'args': {
                # set rendering arguments here
                'html': 1,
            }
        }
        return request

The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;

Thanks in advance,

Nov 10 '16 17:11 dijadev

I also have this issue.

NORMAL REQUEST - it will follow the rules and Follow=True yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin)

USING SPLASH - it will only visit the first url yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin, meta={ 'splash': { 'endpoint': 'render.html', 'args': { 'wait': 0.5 } } })

Jan 23 '17 19:01 wattsin

Has someone found the solution ?

Jan 27 '17 18:01 dijadev

i have not. unfortunately

On Fri, Jan 27, 2017 at 1:10 PM -0500, "dijadev" [email protected] wrote:

Has someone found the solution ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Jan 27 '17 18:01 wattsin

I have the same problem, any solution?

Feb 13 '17 23:02 amirj

Negative.

Feb 14 '17 13:02 wattsin

+1 over here. Encountering the same issue as described by @wattsin.

Apr 03 '17 19:04 brianherbert

I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    ...

However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won't have any requests to follow.

Jun 08 '17 05:06 dwj1324

Jul 21 '17 10:07 komuher

Aug 09 '17 09:08 ghost

@dwj1324

I tried to debug my spider with PyCharm and set a breakpoint at if not isinstance(response, HtmlResponse):. That code was never reached when SplashRequest was used instead of scrapy.Request.

What worked for me is to add this to the callback parsing function:

def parse_item(self, response):
    """Parse response into item also create new requests."""

    page = RescrapItem()
    ...
    yield page

    if isinstance(response, (HtmlResponse, SplashTextResponse)):
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = SplashRequest(url=link.url, callback=self._response_downloaded, 
                                              args=SPLASH_RENDER_ARGS)
                r.meta.update(rule=rule, link_text=link.text)
                yield rule.process_request(r)

Aug 31 '17 01:08 hieu-n

+1, any update for this issue?

Oct 24 '17 10:10 NingLu

@hieu-n i use the code you paste here, and change splash request to request since i need to use the header, but it doesn't work, the spider still crawl the first depth content, any suggestion will be appreciated

Oct 24 '17 10:10 NingLu

@NingLu I haven't touched scrapy for a while. In your case, what I would do is to set a few breakpoints and step through your code and the scrapy's code. Good luck!

Oct 25 '17 00:10 hieu-n

+1 any updates here?

Jan 16 '18 06:01 Goles

Hello everyone ! As @dwj1324 said the CrawlSpider do a response type check in _requests_to_follow function. So I've juste overridden this function to avoid escaping SplashJsonResponse(s):

hope this helps !

Jan 16 '18 09:01 dijadev

Having the same issue. Have overridden _requests_to_follow as stated by @dwj1324 and @dijadev.

As soon as I start using splash by adding the following code to my spider:

def start_requests(self):
        for url in self.start_urls:
            print('->', url)
            yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

it does not call _requests_to_follow anymore. Scrapy follows links when commenting out that function again.

Feb 06 '18 05:02 tf42src

Hi, I have found a workaround which works for me: Instead of using a scrapy request: yield scrapy.Request(page_url, self.parse_page) simply append this splash prefix to the url: yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page) the localhost port may depend on how you built spalsh docker

Apr 05 '18 14:04 VictorXunS

Hi, I have found a workaround which works for me: Instead of using a scrapy request: yield scrapy.Request(page_url, self.parse_page) simply append this splash prefix to the url: yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page) the localhost port may depend on how you built spalsh docker

@VictorXunS this is not working for me, could you share all your CrawlSpider code?

Jan 07 '19 20:01 reg3x

Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev and @hieu-n for suggestions.

` def _requests_to_follow(self, response):
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        for link in links:
            seen.add(link)
            r = self._build_request(n, link)
            yield rule.process_request(r)

` def _build_request(self, rule, link):
    r = Request(url=link.url, callback=self._response_downloaded)
    r.meta.update(rule=rule, link_text=link.text)
    return r

Feb 18 '19 19:02 victor-papa

I am not expert, but scrapy has its own filter, isn't it? (you use not seen)

http://doc.scrapy.org/en/latest/topics/link-extractors.html http://doc.scrapy.org/en/latest/topics/link-extractors.html *class *scrapy.linkextractors.lxmlhtml.LxmlLinkExtracto

             unique* (*boolean*) – whether duplicate filtering should

be applied to extracted links.

http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Libre de virus. www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

El lun., 18 feb. 2019 a las 20:17, Nick-Verdegem ([email protected]) escribió:

Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev https://github.com/dijadev and @hieu-n https://github.com/hieu-n for suggestions.

` def _requests_to_follow(self, response): seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = self._build_request(n, link) yield rule.process_request(r)

def _build_request(self, rule, link): r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=rule, link_text=link.text) return r

`

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/92#issuecomment-464848375, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu87plFAMY-qF8MolZsRwKXMp4Imrks5vOvxCgaJpZM4Ku50c .

Feb 18 '19 19:02 JavierRuano

Hi @Nick-Verdegem thank you for sharing. My CrawlSPider is still not working with your solution, do you use start_requests?

Feb 19 '19 12:02 XamHans

So i encountered this issue and solved it by overriding the type check as suggested :

def _requests_to_follow(self, response):
      if not isinstance(response, (HtmlResponse, SplashTextResponse)):
          return
      ....

but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the 'rule' its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:

def use_splash(self, request):
      request.meta.update(splash={
          'args': {
              'wait': 1,
          },
          'endpoint': 'render.html',           
      })
      return request

and add it to ur Rule : process_request="use_splash" the _requests_to_follow will apply the process_request to every built request, thats what worked for my CrawlSpider Hope that helps!

Mar 02 '19 19:03 MontaLabidi

I use scrapy-splash and scrapy-redis

RedisCrawlSpider can running.

Need to rewrite

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse_m, endpoint='execute', dont_filter=True, args={
                'url': url, 'wait': 5, 'lua_source': default_script
            })

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def _build_request(self, rule, link):
        # parameter 'meta' is required !!!!!
        r = SplashRequest(url=link.url, callback=self._response_downloaded, meta={'rule': rule, 'link_text': link.text},
                          args={'wait': 5, 'url': link.url, 'lua_source': default_script})
        # Maybe you can delete it here.
        r.meta.update(rule=rule, link_text=link.text)
        return r

Some parameters need to be modified by themselves

Mar 06 '19 16:03 nciefeiniu

@MontaLabidi Your solution worked for me.

This is how my code looks:


class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

Apr 26 '19 16:04 sp-philippe-oger

@sp-philippe-oger could you please show the whole file? In my case the crawl spider won't call the redefined _requests_to_follow and as a consequence still stops after the first page...

May 02 '19 13:05 digitaldust

@digitaldust pretty much the whole code is there. Not sure what is missing for you to make it work.

May 10 '19 10:05 sp-philippe-oger

@sp-philippe-oger don't worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo... thanks!

May 10 '19 11:05 digitaldust

Anyone get this to work while running a Lua script for each pagination?

Oct 01 '19 19:10 MSDuncan82

@nciefeiniu hi... would you please give more information about integrating scrapy-redis with splash? i mean, how do you send your urls from redis to splash?

Oct 28 '19 08:10 davisbra

@MontaLabidi Your solution worked for me.

This is how my code looks:

class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

I use python3, but there's an error: _identity_process_request() missing 1 required positional argument. Is there something wrong?

May 10 '20 13:05 zhaicongrong

scrapy-splash scrapy-splash copied to clipboard

scrapy-splash recursive crawl using CrawlSpider not working

scrapy-splash
scrapy-splash copied to clipboard