scrapy-splash
scrapy-splash copied to clipboard
splash works well with scrapy shell but not with scrapy-splash
I am using scrapy-splash to scrape a youtube video page. However, it seems the response object it's not complete when I use my spider. But I got a complete result when I use the scrapy shell.
I also downloaded a copy of the html response from splash GUI (http://localhost:8050) and compared with the view(response)
inside inspect_response(response, self)
method and they're different, the one it's used by the spider it's not complete.
scrapy shell:
scrapy shell 'http://localhost:8050/render.html?url=https://www.youtube.com/watch?v=HOfTrhmIXIM&wait=2.0'
Scrapy shell correct result:
response.xpath('//*[@id="container"]/h1/yt-formatted-string/text()').extract_
...: first(default='')
Out[1]: 'Scraping, analyzing youtube channel data with python'
My spider using the scrapy-splash:
/videoSpider.py
import scrapy
from scrapy_splash import SplashRequest
from youtube_scrapy.items import YoutubeVideoItem
from scrapy.shell import inspect_response
class videoSpider(scrapy.Spider):
name = "videoSpider"
start_urls = ["https://www.youtube.com/watch?v=HOfTrhmIXIM"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html', args={'wait':2.0})
def parse(self, response):
# inspect_response(response, self)
# print(response.meta['splash'])
print(response.real_url)
item = YoutubeVideoItem()
item['keywords'] = response.xpath('/html/head/meta[@name="keywords"]/@content').extract_first(default='')
item['title'] = response.xpath('//*[@id="container"]/h1/yt-formatted-string/text()').extract_first(default='')
item['visualizations'] = response.xpath('//*[@id="count"]/yt-view-count-renderer/span[1]/text()').extract_first(default='')
item['publication_data'] = response.xpath('//*[@id="date"]/yt-formatted-string/text()').extract_first(default='')
item['likes'] = response.xpath('//*[@id="text"]/text()')[2].extract()
item['dislikes'] = response.xpath('//*[@id="text"]/text()')[3].extract()
item['description'] = response.xpath('//*[@id="description"]/yt-formatted-string/text()').extract_first(default='')
item['channel_name'] = response.xpath('//*[@id="text"]/a/text()').extract_first(default='')
item['channel_subscribers'] = response.xpath('//*[@id="owner-sub-count"]/text()').extract_first(default='')
yield item
/settings.js
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# url of splash server
SPLASH_URL = 'http://localhost:8050'
# and some splash variables
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
Results using the spider:
{'channel_name': '',
'channel_subscribers': '',
'description': '',
'keywords': 'python, data science, data, data analysis, web scraping, '
'scraping',
'publication_data': '',
'title': '',
'visualizations': ''}
I couldn't make sense of why it wouldn't work inside scrapy script.
@BravoNatalie try setting the endpoint
argument of your SplashRequest to execute
(instead of the default render.html
). That fixed me up when I was facing similar issues :)