scrapy-splash
scrapy-splash copied to clipboard
scrapy-splash recursive crawl using CrawlSpider not working
Hi !
I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:
def process_request(self,request):
request.meta['splash']={
'args': {
# set rendering arguments here
'html': 1,
}
}
return request
The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,
I also have this issue.
NORMAL REQUEST - it will follow the rules and Follow=True yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin)
USING SPLASH - it will only visit the first url yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin, meta={ 'splash': { 'endpoint': 'render.html', 'args': { 'wait': 0.5 } } })
Has someone found the solution ?
i have not. unfortunately
On Fri, Jan 27, 2017 at 1:10 PM -0500, "dijadev" [email protected] wrote:
Has someone found the solution ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I have the same problem, any solution?
Negative.
+1 over here. Encountering the same issue as described by @wattsin.
I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
...
However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won't have any requests to follow.
+1
+1
@dwj1324
I tried to debug my spider with PyCharm and set a breakpoint at if not isinstance(response, HtmlResponse):
. That code was never reached when SplashRequest
was used instead of scrapy.Request
.
What worked for me is to add this to the callback parsing function:
def parse_item(self, response):
"""Parse response into item also create new requests."""
page = RescrapItem()
...
yield page
if isinstance(response, (HtmlResponse, SplashTextResponse)):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = SplashRequest(url=link.url, callback=self._response_downloaded,
args=SPLASH_RENDER_ARGS)
r.meta.update(rule=rule, link_text=link.text)
yield rule.process_request(r)
+1, any update for this issue?
@hieu-n i use the code you paste here, and change splash request to request since i need to use the header, but it doesn't work, the spider still crawl the first depth content, any suggestion will be appreciated

@NingLu I haven't touched scrapy for a while. In your case, what I would do is to set a few breakpoints and step through your code and the scrapy's code. Good luck!
+1 any updates here?
Hello everyone !
As @dwj1324 said the CrawlSpider do a response type check in _requests_to_follow function.
So I've juste overridden this function to avoid escaping SplashJsonResponse(s):
hope this helps !
Having the same issue. Have overridden _requests_to_follow
as stated by @dwj1324 and @dijadev.
As soon as I start using splash by adding the following code to my spider:
def start_requests(self):
for url in self.start_urls:
print('->', url)
yield SplashRequest(url, self.parse_item, args={'wait': 0.5})
it does not call _requests_to_follow
anymore. Scrapy follows links when commenting out that function again.
Hi, I have found a workaround which works for me:
Instead of using a scrapy request:
yield scrapy.Request(page_url, self.parse_page)
simply append this splash prefix to the url:
yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page)
the localhost port may depend on how you built spalsh docker
Hi, I have found a workaround which works for me: Instead of using a scrapy request:
yield scrapy.Request(page_url, self.parse_page)
simply append this splash prefix to the url:yield scrapy.Request("http://localhost:8050/render.html?url=" + page_url, self.parse_page)
the localhost port may depend on how you built spalsh docker
@VictorXunS this is not working for me, could you share all your CrawlSpider code?
Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev and @hieu-n for suggestions.
` def _requests_to_follow(self, response):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
`
` def _build_request(self, rule, link):
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=rule, link_text=link.text)
return r
`
I am not expert, but scrapy has its own filter, isn't it? (you use not seen)
http://doc.scrapy.org/en/latest/topics/link-extractors.html http://doc.scrapy.org/en/latest/topics/link-extractors.html *class *scrapy.linkextractors.lxmlhtml.LxmlLinkExtracto
-
unique* (*boolean*) – whether duplicate filtering should
be applied to extracted links.
http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail Libre de virus. www.avg.com http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
El lun., 18 feb. 2019 a las 20:17, Nick-Verdegem ([email protected]) escribió:
Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev https://github.com/dijadev and @hieu-n https://github.com/hieu-n for suggestions.
` def _requests_to_follow(self, response): seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = self._build_request(n, link) yield rule.process_request(r)
def _build_request(self, rule, link): r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=rule, link_text=link.text) return r
`
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/92#issuecomment-464848375, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu87plFAMY-qF8MolZsRwKXMp4Imrks5vOvxCgaJpZM4Ku50c .
Hi @Nick-Verdegem thank you for sharing. My CrawlSPider is still not working with your solution, do you use start_requests?
So i encountered this issue and solved it by overriding the type check as suggested :
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse)):
return
....
but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the 'rule' its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:
def use_splash(self, request):
request.meta.update(splash={
'args': {
'wait': 1,
},
'endpoint': 'render.html',
})
return request
and add it to ur Rule :
process_request="use_splash"
the _requests_to_follow will apply the process_request to every built request, thats what worked for my CrawlSpider
Hope that helps!
I use scrapy-splash and scrapy-redis
RedisCrawlSpider can running.
Need to rewrite
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse_m, endpoint='execute', dont_filter=True, args={
'url': url, 'wait': 5, 'lua_source': default_script
})
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def _build_request(self, rule, link):
# parameter 'meta' is required !!!!!
r = SplashRequest(url=link.url, callback=self._response_downloaded, meta={'rule': rule, 'link_text': link.text},
args={'wait': 5, 'url': link.url, 'lua_source': default_script})
# Maybe you can delete it here.
r.meta.update(rule=rule, link_text=link.text)
return r
Some parameters need to be modified by themselves
@MontaLabidi Your solution worked for me.
This is how my code looks:
class MySuperCrawler(CrawlSpider):
name = 'mysupercrawler'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
rules = (
Rule(LxmlLinkExtractor(
restrict_xpaths='//div/a'),
follow=True
),
Rule(LxmlLinkExtractor(
restrict_xpaths='//div[@class="pages"]/li/a'),
process_request="use_splash",
follow=True
),
Rule(LxmlLinkExtractor(
restrict_xpaths='//a[@class="product"]'),
callback='parse_item',
process_request="use_splash"
)
)
def _requests_to_follow(self, response):
if not isinstance(
response,
(HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def use_splash(self, request):
request.meta.update(splash={
'args': {
'wait': 1,
},
'endpoint': 'render.html',
})
return request
def parse_item(self, response):
pass
This works perfectly for me.
@sp-philippe-oger could you please show the whole file? In my case the crawl spider won't call the redefined _requests_to_follow and as a consequence still stops after the first page...
@digitaldust pretty much the whole code is there. Not sure what is missing for you to make it work.
@sp-philippe-oger don't worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo... thanks!
Anyone get this to work while running a Lua script for each pagination?
@nciefeiniu hi... would you please give more information about integrating scrapy-redis with splash? i mean, how do you send your urls from redis to splash?
@MontaLabidi Your solution worked for me.
This is how my code looks:
class MySuperCrawler(CrawlSpider): name = 'mysupercrawler' allowed_domains = ['example.com'] start_urls = ['https://www.example.com'] rules = ( Rule(LxmlLinkExtractor( restrict_xpaths='//div/a'), follow=True ), Rule(LxmlLinkExtractor( restrict_xpaths='//div[@class="pages"]/li/a'), process_request="use_splash", follow=True ), Rule(LxmlLinkExtractor( restrict_xpaths='//a[@class="product"]'), callback='parse_item', process_request="use_splash" ) ) def _requests_to_follow(self, response): if not isinstance( response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)): return seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = self._build_request(n, link) yield rule.process_request(r) def use_splash(self, request): request.meta.update(splash={ 'args': { 'wait': 1, }, 'endpoint': 'render.html', }) return request def parse_item(self, response): pass
This works perfectly for me.
I use python3, but there's an error: _identity_process_request() missing 1 required positional argument. Is there something wrong?