frontera icon indicating copy to clipboard operation
frontera copied to clipboard

How to yield splash request when using seed file

Open MuhammadRahman-awin opened this issue 8 years ago • 9 comments

Hello there, I had a spider, which works fine for now, but since requirements are growing I looked at frontera to have millions of page to download. Since my urls are pre-defined, I have put them in a seed file. The pages also need splash rendering. I can't figure out, how to handle the splash request using the seed file. The start_requests is not getting called when using seed. Here is my code:

` import scrapy from scrapy_splash import SplashRequest from nothingbutsales.items import FootasylumItem from scrapy.loader import ItemLoader

class FootasylumSpider(scrapy.Spider): name = 'footasylum' http_user = 'user' http_pass = 'userpass'

# result = {}
# with open('/home/shopforsales/public_html/datafeed/beforescrape/footasylum.csv', 'rb') as f:
#     reader = csv.DictReader(f, delimiter=';')
#     for row in reader:
#         for column, value in row.iteritems():
#             result.setdefault(column, []).append(value)

# allowed_domains = ['www.footasylum.com']
# start_urls = result['merchant_deep_link']

custom_settings = {
    'LOG_FILE': '/tmp/' + name + '.log',
    'SEEDS_SOURCE': '/home/shopforsales/public_html/datafeed/beforescrape/seeds.txt'
}

def __init__(self, *args, **kwargs):
    super(FootasylumSpider, self).__init__(*args, **kwargs)

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse)

def parse(self, response):
    print response.url
    price = response.css('li#priceFrm')
    rrp_price = price.css('span.wasprice::text').extract_first()
    print(rrp_price)

    if rrp_price is not None:
        loader = ItemLoader(item=FootasylumItem(), response=response)
        loader.add_value('rrp_price', rrp_price)
        loader.add_value('merchant_deep_link', response.url)
        loader.add_css('product_name', 'h1[itemprop="name"]')
        loader.add_css('brand', 'div[itemprop="brand"]')
        loader.add_css('breadcrumb', 'div.breadcrumb span a')
        loader.add_css('colour', 'title')
        loader.add_css('description', 'li#prod_descr')
        size_loader = loader.nested_xpath('//*[@id="uloption2"]')
        size_loader.add_css('sizes', 'span.sizevariant')
        size_loader.add_css('sizes', 'span.sizevarianttooltip')

        return loader.load_item()

` Hope you can help me.

MuhammadRahman-awin avatar Oct 31 '17 02:10 MuhammadRahman-awin

Frontera calls the make_requests_from_url method of the spider to create a Request from the seeds provided. have a look here. https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/middlewares/seeds/init.py#L18. You could do:

class YourSpider(Spider):
     def make_requests_from_url(self, url):
          return SplashRequest(url)
      
     def parse(self, response):
         ...
         ...

voith avatar Oct 31 '17 06:10 voith

Thank you very much. I will try and update you in the evening. In my scenario, I need to read 150k urls from a csv/db table and have it in the seed. So I will have to read the db, save it in the seed. Is there anyway I can do it in a faster? Thanks.

MuhammadRahman-awin avatar Oct 31 '17 15:10 MuhammadRahman-awin

Hi @voith I tested your solution, but unfortunately it only scrape the first url. Any idea please?

MuhammadRahman-awin avatar Oct 31 '17 22:10 MuhammadRahman-awin

Hi @MuhammadRahman-awin I'm not very sure if splash works very well with frontera. I'm not a splash user myself, so I'm not very familiar with the request/response cycle of splash.

If you have the desire to debug frontera to get your splash code working than I could point you to a couple of places in the code for possible suspects.

  1. https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/requests/converters.py. Frontera converts scrapy's builtin Request to its custom FrontierRequest and return scrapy's default Response to the spider. Frontera assumes that the user will always use the default Request and Response.

  2. Check if your requests are being scheduled. Add a debugger in this method https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/scrapy/schedulers/frontier.py#L91. See If you get all you requests here. Btw, which backend are you using? Can you see your requests added to your datastore?

voith avatar Nov 01 '17 06:11 voith

Also have a look at https://github.com/scrapinghub/frontera/issues/232

voith avatar Nov 01 '17 07:11 voith

commenting lines :

'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware':

1000, and

SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.

FronteraScheduler' is a good as using plain scrapy without frontera

Also what is this scheduler and why do I need to use it? Scrapy has a builtin scheduler for scheduling your requests to the downloader, but it uses a in-memory scheduler.

Frontera is designed for broad crawling and for being polite to websites. Frontera overrides the scheduler so that it can persist requests in a non volatile memory like db and than allow user to implement a custom strategy for rescheduling requests.

On Mon, Nov 6, 2017 at 3:33 AM, MuhammadRahman-awin < [email protected]> wrote:

Further to my investigation, I found the SchedulerSpiderMiddleware does not recognise the splash request. @voith https://github.com/voith your snippet actually works when I comment out this middleware and the Frontera scheduler

'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware':

1000, and

SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.

FronteraScheduler'

Any suggestion please? Also what is this scheduler and why do I need to use it? Thanks a lot

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scrapinghub/frontera/issues/300#issuecomment-342009543, or mute the thread https://github.com/notifications/unsubscribe-auth/AHyuuUTcBVVQMV7ipNd7YcS36-IgT_Llks5szjCmgaJpZM4QMG8O .

voith avatar Nov 06 '17 05:11 voith

I think first step for @m-usman-dar is make Scrapy work without Frontera with Splash using SplashMiddleware.

sibiryakov avatar Nov 06 '17 10:11 sibiryakov

With your support and patience I have managed to use the single threaded scraper with frontera. However I had to override the scheduler to add splash_url check.

MuhammadRahman-awin avatar Nov 06 '17 23:11 MuhammadRahman-awin

submit a PR. we can discuss there if you think there's something wrong with frontera

voith avatar Nov 07 '17 06:11 voith