wordpress-scraper

Description

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Features

Support storing crawled documents as MongoDB documents / JSON files
Auto retry upon errors

Requirements

Python 3.7+

Installation

pip install -r requirements.txt

How to use

Basic

Just run crawl.py with the sites URL supplied:

python3 crawl.py https://your.website.here

This will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.

The crawled JSON files will be stored in the directory ./data/<domain-name>.

Most of the time, This will suffice when scraping sites that are:

not required to sign in
JSON API paths not blocked

Advanced

For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.

Upcoming Features

[x] Rewrite/Refactor
[x] MongoDB Connector
[ ] Async session
[ ] Authentication Module
[ ] Cloudflare circumvention
[ ] Configurable retry policies
[ ] Full WPv2 API resources support

wordpress-scraper
wordpress-scraper copied to clipboard

Metadata

wordpress-scraper

Description

Features

Requirements

Installation

How to use

Basic

Advanced

Upcoming Features

← Metadata

Owner

Metadata

wordpress-scraper wordpress-scraper copied to clipboard

Metadata

wordpress-scraper

Description

Features

Requirements

Installation

How to use

Basic

Advanced

Upcoming Features

← Metadata

Owner

Metadata

wordpress-scraper
wordpress-scraper copied to clipboard