scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

splash works well with scrapy shell but not with scrapy-splash

Open BravoNatalie opened this issue 4 years ago • 1 comments

I am using scrapy-splash to scrape a youtube video page. However, it seems the response object it's not complete when I use my spider. But I got a complete result when I use the scrapy shell.

I also downloaded a copy of the html response from splash GUI (http://localhost:8050) and compared with the view(response) inside inspect_response(response, self) method and they're different, the one it's used by the spider it's not complete.

scrapy shell:

scrapy shell 'http://localhost:8050/render.html?url=https://www.youtube.com/watch?v=HOfTrhmIXIM&wait=2.0'

Scrapy shell correct result:

response.xpath('//*[@id="container"]/h1/yt-formatted-string/text()').extract_
   ...: first(default='')                                                            
Out[1]: 'Scraping, analyzing youtube channel data with python'

My spider using the scrapy-splash:

/videoSpider.py

import scrapy
from scrapy_splash import SplashRequest
from youtube_scrapy.items import YoutubeVideoItem 
from scrapy.shell import inspect_response

class videoSpider(scrapy.Spider):
    name = "videoSpider"
    start_urls = ["https://www.youtube.com/watch?v=HOfTrhmIXIM"]


    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse, endpoint='render.html', args={'wait':2.0})

    def parse(self, response):
        # inspect_response(response, self)
        # print(response.meta['splash'])
        print(response.real_url)
        item = YoutubeVideoItem()
        item['keywords'] = response.xpath('/html/head/meta[@name="keywords"]/@content').extract_first(default='')
        item['title'] = response.xpath('//*[@id="container"]/h1/yt-formatted-string/text()').extract_first(default='')
        item['visualizations'] = response.xpath('//*[@id="count"]/yt-view-count-renderer/span[1]/text()').extract_first(default='')
        item['publication_data'] = response.xpath('//*[@id="date"]/yt-formatted-string/text()').extract_first(default='')
        item['likes'] = response.xpath('//*[@id="text"]/text()')[2].extract()
        item['dislikes'] = response.xpath('//*[@id="text"]/text()')[3].extract()
        item['description'] = response.xpath('//*[@id="description"]/yt-formatted-string/text()').extract_first(default='')
        item['channel_name'] = response.xpath('//*[@id="text"]/a/text()').extract_first(default='')
        item['channel_subscribers'] = response.xpath('//*[@id="owner-sub-count"]/text()').extract_first(default='')
        
        yield item

/settings.js

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# url of splash server
SPLASH_URL = 'http://localhost:8050'

# and some splash variables
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Results using the spider:

{'channel_name': '',
 'channel_subscribers': '',
 'description': '',
 'keywords': 'python, data science, data, data analysis, web scraping, '
             'scraping',
 'publication_data': '',
 'title': '',
 'visualizations': ''}

I couldn't make sense of why it wouldn't work inside scrapy script.

BravoNatalie avatar May 16 '20 23:05 BravoNatalie

@BravoNatalie try setting the endpoint argument of your SplashRequest to execute (instead of the default render.html). That fixed me up when I was facing similar issues :)

pearsonhenri avatar Nov 02 '20 15:11 pearsonhenri