How to yield splash request when using seed file
Hello there, I had a spider, which works fine for now, but since requirements are growing I looked at frontera to have millions of page to download. Since my urls are pre-defined, I have put them in a seed file. The pages also need splash rendering. I can't figure out, how to handle the splash request using the seed file. The start_requests is not getting called when using seed. Here is my code:
` import scrapy from scrapy_splash import SplashRequest from nothingbutsales.items import FootasylumItem from scrapy.loader import ItemLoader
class FootasylumSpider(scrapy.Spider): name = 'footasylum' http_user = 'user' http_pass = 'userpass'
# result = {}
# with open('/home/shopforsales/public_html/datafeed/beforescrape/footasylum.csv', 'rb') as f:
# reader = csv.DictReader(f, delimiter=';')
# for row in reader:
# for column, value in row.iteritems():
# result.setdefault(column, []).append(value)
# allowed_domains = ['www.footasylum.com']
# start_urls = result['merchant_deep_link']
custom_settings = {
'LOG_FILE': '/tmp/' + name + '.log',
'SEEDS_SOURCE': '/home/shopforsales/public_html/datafeed/beforescrape/seeds.txt'
}
def __init__(self, *args, **kwargs):
super(FootasylumSpider, self).__init__(*args, **kwargs)
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse)
def parse(self, response):
print response.url
price = response.css('li#priceFrm')
rrp_price = price.css('span.wasprice::text').extract_first()
print(rrp_price)
if rrp_price is not None:
loader = ItemLoader(item=FootasylumItem(), response=response)
loader.add_value('rrp_price', rrp_price)
loader.add_value('merchant_deep_link', response.url)
loader.add_css('product_name', 'h1[itemprop="name"]')
loader.add_css('brand', 'div[itemprop="brand"]')
loader.add_css('breadcrumb', 'div.breadcrumb span a')
loader.add_css('colour', 'title')
loader.add_css('description', 'li#prod_descr')
size_loader = loader.nested_xpath('//*[@id="uloption2"]')
size_loader.add_css('sizes', 'span.sizevariant')
size_loader.add_css('sizes', 'span.sizevarianttooltip')
return loader.load_item()
` Hope you can help me.
Frontera calls the make_requests_from_url method of the spider to create a Request from the seeds provided. have a look here. https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/middlewares/seeds/init.py#L18.
You could do:
class YourSpider(Spider):
def make_requests_from_url(self, url):
return SplashRequest(url)
def parse(self, response):
...
...
Thank you very much. I will try and update you in the evening. In my scenario, I need to read 150k urls from a csv/db table and have it in the seed. So I will have to read the db, save it in the seed. Is there anyway I can do it in a faster? Thanks.
Hi @voith I tested your solution, but unfortunately it only scrape the first url. Any idea please?
Hi @MuhammadRahman-awin I'm not very sure if splash works very well with frontera. I'm not a splash user myself, so I'm not very familiar with the request/response cycle of splash.
If you have the desire to debug frontera to get your splash code working than I could point you to a couple of places in the code for possible suspects.
-
https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/requests/converters.py. Frontera converts scrapy's builtin
Requestto its custom FrontierRequest and return scrapy's defaultResponseto the spider. Frontera assumes that the user will always use the defaultRequestandResponse. -
Check if your requests are being scheduled. Add a debugger in this method https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/schedulers/frontier.py#L91. See If you get all you requests here. Btw, which backend are you using? Can you see your requests added to your datastore?
Also have a look at https://github.com/scrapinghub/frontera/issues/232
commenting lines :
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware':
1000, and
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.
FronteraScheduler' is a good as using plain scrapy without frontera
Also what is this scheduler and why do I need to use it?
Scrapy has a builtin scheduler for scheduling your requests to the downloader, but it uses a
in-memory scheduler.
Frontera is designed for broad crawling and for being polite to websites. Frontera overrides the scheduler so that it can persist requests in a non volatile memory like db and than allow user to implement a custom strategy for rescheduling requests.
On Mon, Nov 6, 2017 at 3:33 AM, MuhammadRahman-awin < [email protected]> wrote:
Further to my investigation, I found the SchedulerSpiderMiddleware does not recognise the splash request. @voith https://github.com/voith your snippet actually works when I comment out this middleware and the Frontera scheduler
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware':
1000, and
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.
FronteraScheduler'
Any suggestion please? Also what is this scheduler and why do I need to use it? Thanks a lot
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scrapinghub/frontera/issues/300#issuecomment-342009543, or mute the thread https://github.com/notifications/unsubscribe-auth/AHyuuUTcBVVQMV7ipNd7YcS36-IgT_Llks5szjCmgaJpZM4QMG8O .
I think first step for @m-usman-dar is make Scrapy work without Frontera with Splash using SplashMiddleware.
With your support and patience I have managed to use the single threaded scraper with frontera. However I had to override the scheduler to add splash_url check.
submit a PR. we can discuss there if you think there's something wrong with frontera