CrawlerFlow

Web Crawlers orchestration framework that lets you create datasets from multiple web sources using yaml configurations.

Features

[*] Write spiders in the YAML configs.
[*] Create extractors to scrape data using YAML configs (HTML, API, RSS)
[*] Define multiple extractors per spider.
[*] Use standard extractors to scrape data like Tables, Paragraphs, Meta tags, JSON+LD of the page.
[ ] Traverse between multiple websites.
[ ] Write Python Extractors for advanced extraction strategy

Installation

pip install git+https://github.com/invana/crawlerflow#egg=crawlerflow

Usage

Scraping with CrawlerFlow

from crawlerflow.runner import Crawlerflow
from crawlerflow.utils import yaml_to_json


crawl_requests = yaml_to_json(open("example-configs/crawlerflow/requests/github-detail-urls.yml"))
spider_config = yaml_to_json(open("example-configs/crawlerflow/spiders/default-spider.yml"))
github_default_extractor = yaml_to_json(open("example-configs/crawlerflow/extractors/github-blog-detail.yml"))

flow = Crawlerflow()
flow.add_spider_with_config(crawl_requests, spider_config, default_extractor=github_default_extractor)
flow.start()

Scraping with WebCrawler

from crawlerflow.runner import WebCrawler
from crawlerflow.utils import yaml_to_json

 
scraper_config_files = [
    "example-configs/webcrawler/APISpiders/api-publicapis-org.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-list.yml",
    "example-configs/webcrawler/HTMLSpiders/github-blog-detail.yml"
]

crawlerflow = WebCrawler()

for scraper_config_file in scraper_config_files:
    scraper_config = yaml_to_json(open(scraper_config_file))
    crawlerflow.add_spider_with_config(scraper_config)
crawlerflow.start()

Refer examples-configs/ folder for example configs.

Available Extractors

[*] HTMLExtractor
[*] MetaTagExtractor
[*] JSONLDExtractor
[*] TableContentExtractor
[*] IconsExtractor

crawlerflow
crawlerflow copied to clipboard

Metadata

CrawlerFlow

Features

Installation

Usage

Scraping with CrawlerFlow

Scraping with WebCrawler

Available Extractors

← Metadata

Owner

Metadata

crawlerflow crawlerflow copied to clipboard

Metadata

CrawlerFlow

Features

Installation

Usage

Scraping with CrawlerFlow

Scraping with WebCrawler

Available Extractors

← Metadata

Owner

Metadata

crawlerflow
crawlerflow copied to clipboard