crawler
crawler copied to clipboard
Web Scraping Framework
======= Crawler
.. image:: https://travis-ci.org/lorien/crawler.png?branch=master :target: https://travis-ci.org/lorien/crawler
.. image:: https://coveralls.io/repos/lorien/crawler/badge.svg?branch=master :target: https://coveralls.io/r/lorien/crawler?branch=master
This project is about:
- 100% test coverage
- Speed
- Reasonable memory consumption
- SOCKS proxy support
- Runtime metrics API
Usage Example
.. code:: python
from urllib.parse import urlsplit
from crawler import Crawler, Request
class TestCrawler(Crawler):
def task_generator(self):
for host in ('yandex.ru', 'github.com'):
yield Request('page', 'https://%s/' % host, meta={'host': host})
def handler_page(self, req, res):
title = res.xpath('//title').text(default='N/A')
print('Title of [%s]: %s' % (req.url, title))
ext_urls = set()
for elem in res.xpath('//a[@href]'):
url = elem.attr('href')
parts = urlsplit(url)
if parts.netloc and req.meta['host'] not in parts.netloc:
ext_urls.add(url)
print('External URLs:')
for url in ext_urls:
print(' * %s' % url)
bot = TestCrawler(num_network_threads=10)
bot.run()
Quick Way to Start New Project
- Install crawler with
pip install crawler
- Run command
crawl_start_project <project_name>
- cd into new directory
- Run
make build
That'll build virtualenv with all things you need to start using crawler.
To active virtualenv run pipenv shell
command.
Put you crawler code into crawlers/
directory.
Then run it with command crawl <CrawlerClassName>
Installation
.. code:: bash
pip install crawler