bbscraper icon indicating copy to clipboard operation
bbscraper copied to clipboard

Simple phpBB forum thread web scraper written in Python

bbscraper Build Status

Simple phpBB forum thread web scraper written in Python. Designed for command-line usage. Outputs data as CSV format into stdout.

This is an experiment-driven project. The code tends to be, but it's not fully idiomatic according to PEP8. The current implementation is very ad-hoc for a concrete particular scenario, however extending it to cover additional behavior and features should be trivial.

The scraped data fields per thread post are (in order): Post ID, Post name, Date of the post and Post body

Uses urllib3 for HTTP networking and BeautifulSoup for HTML parsing.

This package is not available via pip. You must download or clone this repository in order to use it.

Requirements

Installation

Clone this repository:

git clone https://github.com/h2non/bbscraper.git && cd bbscraper

Install dependencies via pip:

sudo pip install -r requirements.txt

Or alternatively using setup.py:

python setup.py install

Command-line interface

usage: __main__.py [-h] -u URL [-f FORMAT] [-l LIMIT]

Scrape all thread posts of a phpBB based forum

optional arguments:
  -h, --help            show this help message and exit
  -u URL, --url URL     Full URL to forum thread
  -f FORMAT, --format FORMAT
                        Output format (default to CSV)

Report any issues to https://github.com/h2non/bbscraper/issues

Scrap the website and save data in forum.csv:

python bbscraper -u http://www.oldclassiccar.co.uk/forum/phpbb/phpBB2/viewtopic.php?t=12591 > forum.csv

Development

Run tests:

make test

License

MIT - Tomas Aparicio