crawlerflow icon indicating copy to clipboard operation
crawlerflow copied to clipboard

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

CrawlerFlow

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Features

Features

  • [*] Write spiders in the YAML configs.
  • [*] Create extractors to scrape data using YAML configs (HTML, API, RSS)
  • [*] Define multiple extractors per spider.
  • [*] Use standard extractors to scrape data like Tables, Paragraphs, Meta tags, JSON+LD of the page.
  • [ ] Traverse between multiple websites.
  • [ ] Write Python Extractors for advanced extraction strategy

Installation

pip install git+https://github.com/invana/crawlerflow#egg=crawlerflow

Usage

Scraping with CrawlerFlow

from crawlerflow.runner import Crawlerflow
from crawlerflow.utils import yaml_to_json


crawl_requests = yaml_to_json(open("example-configs/crawlerflow/requests/github-detail-urls.yml"))
spider_config = yaml_to_json(open("example-configs/crawlerflow/spiders/default-spider.yml"))
github_default_extractor = yaml_to_json(open("example-configs/crawlerflow/extractors/github-blog-detail.yml"))

flow = Crawlerflow()
flow.add_spider_with_config(crawl_requests, spider_config, default_extractor=github_default_extractor)
flow.start()

Scraping with WebCrawler

from crawlerflow.runner import WebCrawler
from crawlerflow.utils import yaml_to_json

 
scraper_config_files = [
    "example-configs/webcrawler/APISpiders/api-publicapis-org.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-list.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-detail.yml"
]

crawlerflow = WebCrawler()

for scraper_config_file in scraper_config_files:
    scraper_config = yaml_to_json(open(scraper_config_file))
    crawlerflow.add_spider_with_config(scraper_config)
crawlerflow.start()

Refer examples-configs/ folder for example configs.

Available Extractors

  • [*] HTMLExtractor
  • [*] MetaTagExtractor
  • [*] JSONLDExtractor
  • [*] TableContentExtractor
  • [*] IconsExtractor