scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

Add an option to send requests to Splash by default

Open kmike opened this issue 9 years ago • 8 comments

We could create a middleware which adds 'splash' meta key to all requests, or to all requests matching some pattern. It could also decode the results to make the whole thing more or less transparent.

Is it a good idea? Or are explicit requests enough?

kmike avatar Feb 27 '15 21:02 kmike

I would imagine this could work by just adding spider attribute splash_enabled=True. It attribute == True every request should have 'splash' meta key. Or we could even go without meta key just check spider attribute splash_enabled if True use splash for requests.

pawelmhm avatar Apr 10 '15 11:04 pawelmhm

From my experience - it's convenient to be able to enable splash for entire spider. Maybe if we add support for 'splash' spider argument with splash options - it would be enough? For example

class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]
    splash = {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }


yield Request(url, self.parse_result, meta={
    # enable if True
    # disable splash if bool(response.meta('splash')) is False
    'splash': True, 
})

chekunkov avatar Apr 10 '15 11:04 chekunkov

@pawelmhm if you just set Spider.splash = True it's not clear how to pass splash options per spider

chekunkov avatar Apr 10 '15 11:04 chekunkov

right @chekunkov if you only add splash_enabled argument we would also need splash_options (so two arguments needed) so maybe it would be easier to have splash_options and this will enable middleware, sounds good.

pawelmhm avatar Apr 10 '15 11:04 pawelmhm

I like it, but I'm not sure this

yield Request(url, self.parse_result, meta={'splash': True})

is better than this:

yield Request(url, self.parse_result, meta={'splash': self.splash_options})

kmike avatar Apr 10 '15 11:04 kmike

@kmike good point

main idea of meta={'splash': True/False} was ability to disable splash per request if it was enabled for entire spider. but I looked at #15 now - this is what 'dont_proxy' can be used for, so no need to use meta={'splash': True}. If developer wants to use default spider config - he doesn't use 'splash' key. If developer doesn't want to use splash for some request - he can use meta={'dont_proxy': True}. If developer wants to use different splash config - he can use meta={'splash': {'args': {}, ...} as usual. wdyt?

chekunkov avatar Apr 10 '15 11:04 chekunkov

also question is: do we need 'meta' key for all splash requests? I think we don't need it if splash middleware is enabled in spider attribute.

pawelmhm avatar Apr 10 '15 11:04 pawelmhm

@pawelmhm please check comment above :)

If developer wants to use default spider config - he doesn't use 'splash' key.

chekunkov avatar Apr 10 '15 12:04 chekunkov