scrape-the-gibson
scrape-the-gibson copied to clipboard
Code snippets for a workshop on web scraping.
Scrape the Gibson
These code snippets are the core of a post I wrote about web scraping in python. It's addressed at people who have already done a bit of coding but want to explore scraping in python
in more depth. The workshop will be much easier if you have a Mac or Linux-based computer.
Dependencies
-
Download repo: https://github.com/abelsonlive/scrape-the-gibson
-
Install dependencies
- If you don't have pip installed, type:
sudo easy_install pip
- change directories
cd nyu-skill-share-scraping
- now run:
sudo pip install -r requirements.txt
Topics
Introduction
- Getting started with Scraping in Python using requests
- Exploring HTML documents and extracting the data, with BeautifulSoup
- Saving scraped data to a database with dataset
Advanced
- Thinking about ETL (Extract, Transform, Load)
- Keep your source data around.
- Running multiple requests in parallel to scrape faster
- Thready
- Regular Expressions to Extract More Data
- Programmatic crawling of entire sites.
Links
There are plenty of existing resources on scraping. A few links:
- Paul Bradshaw's Scraping for Journalists, excellent for non-coders.
- School of Data Handbook Recipes
- ScraperWiki (Classic) Docs, moving to GitHub