scrapydo icon indicating copy to clipboard operation
scrapydo copied to clipboard

Crochet-based blocking API for Scrapy.

ScrapyDo

Crochet_-based blocking API for Scrapy_.

This module provides function helpers to run Scrapy_ in a blocking fashion. See the scrapydo-overview.ipynb <http://nbviewer.ipython.org/github/darkrho/scrapydo/blob/master/notebooks/scrapydo-overview.ipynb>_ notebook for a quick overview of this module.

Installation

Using pip::

pip install scrapydo

Usage

The function scrapydo.setup must be called once to initialize the reactor.

Example:

.. code:: python

import scrapydo
scrapydo.setup()

scrapydo.default_settings.update({
    'LOG_LEVEL': 'DEBUG',
    'CLOSESPIDER_PAGECOUNT': 10,
})

# Enable logging display
import logging
logging.basicConfig(level=logging.DEBUG)

# Fetch a single URL.
response = scrapydo.fetch("http://example.com")

# Crawl an URL with given callback.
def parse_page(response):
    yield {
        'title': response.css('title').extract(),
        'url': response.url,
    }
    for href in response.css('a::attr(href)'):
        url = response.urljoin(href)
        yield Request(url, callback=parse_page)

items = scrapydo.crawl('http://example.com', callback)

# Run an existing spider class.
spider_args = {'foo': 'bar'}
items = scrapydo.run_spider(MySpider, **spider_args)

Available Functions

scrapydo.setup() Initialize reactor.

scrapydo.fetch(url, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT) Fetches an URL and returns the response.

scrapydo.crawl(url, callback, spider_cls=DefaultSpider, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT) Crawls an URL with given callback and returns the scraped items.

scrapydo.run_spider(spider_cls, capture_items=True, return_crawler=False, settings=None, timeout=DEFAULT_TIMEOUT, **kwargs) Runs a spider and returns the scraped items.

highlight(code, lexer='html', formatter='html', output_wrapper=None) Highlights given code using pygments. This function is suitable for use in a IPython notebook.

.. _Scrapy: http://scrapy.org .. _Crochet: https://github.com/itamarst/crochet