bbscraper
bbscraper copied to clipboard
Simple phpBB forum thread web scraper written in Python
bbscraper 
Simple phpBB forum thread web scraper written in Python.
Designed for command-line usage. Outputs data as CSV format into stdout.
This is an experiment-driven project. The code tends to be, but it's not fully idiomatic according to PEP8. The current implementation is very ad-hoc for a concrete particular scenario, however extending it to cover additional behavior and features should be trivial.
The scraped data fields per thread post are (in order): Post ID, Post name, Date of the post and Post body
Uses urllib3 for HTTP networking and BeautifulSoup for HTML parsing.
This package is not available via pip.
You must download or clone this repository in order to use it.
Requirements
- python
+3(developed using [email protected]) - pip (optional)
Installation
Clone this repository:
git clone https://github.com/h2non/bbscraper.git && cd bbscraper
Install dependencies via pip:
sudo pip install -r requirements.txt
Or alternatively using setup.py:
python setup.py install
Command-line interface
usage: __main__.py [-h] -u URL [-f FORMAT] [-l LIMIT]
Scrape all thread posts of a phpBB based forum
optional arguments:
-h, --help show this help message and exit
-u URL, --url URL Full URL to forum thread
-f FORMAT, --format FORMAT
Output format (default to CSV)
Report any issues to https://github.com/h2non/bbscraper/issues
Scrap the website and save data in forum.csv:
python bbscraper -u http://www.oldclassiccar.co.uk/forum/phpbb/phpBB2/viewtopic.php?t=12591 > forum.csv
Development
Run tests:
make test
License
MIT - Tomas Aparicio