scrapy-splash
scrapy-splash copied to clipboard
Add an option to send requests to Splash by default
We could create a middleware which adds 'splash' meta key to all requests, or to all requests matching some pattern. It could also decode the results to make the whole thing more or less transparent.
Is it a good idea? Or are explicit requests enough?
I would imagine this could work by just adding spider attribute splash_enabled=True
. It attribute == True every request should have 'splash' meta key. Or we could even go without meta key just check spider attribute splash_enabled
if True use splash for requests.
From my experience - it's convenient to be able to enable splash for entire spider. Maybe if we add support for 'splash' spider argument with splash options - it would be enough? For example
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
splash = {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
# 'url' is prefilled from request url
},
# optional parameters
'endpoint': 'render.json', # optional; default is render.json
'splash_url': '<url>', # overrides SPLASH_URL
'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
}
yield Request(url, self.parse_result, meta={
# enable if True
# disable splash if bool(response.meta('splash')) is False
'splash': True,
})
@pawelmhm if you just set Spider.splash = True it's not clear how to pass splash options per spider
right @chekunkov if you only add splash_enabled
argument we would also need splash_options
(so two arguments needed) so maybe it would be easier to have splash_options
and this will enable middleware, sounds good.
I like it, but I'm not sure this
yield Request(url, self.parse_result, meta={'splash': True})
is better than this:
yield Request(url, self.parse_result, meta={'splash': self.splash_options})
@kmike good point
main idea of meta={'splash': True/False}
was ability to disable splash per request if it was enabled for entire spider. but I looked at #15 now - this is what 'dont_proxy' can be used for, so no need to use meta={'splash': True}
. If developer wants to use default spider config - he doesn't use 'splash' key. If developer doesn't want to use splash for some request - he can use meta={'dont_proxy': True}
. If developer wants to use different splash config - he can use meta={'splash': {'args': {}, ...}
as usual. wdyt?
also question is: do we need 'meta' key for all splash requests? I think we don't need it if splash middleware is enabled in spider attribute.
@pawelmhm please check comment above :)
If developer wants to use default spider config - he doesn't use 'splash' key.