prefect-webscraper-example
prefect-webscraper-example copied to clipboard
Quick and dirty example of using Prefect Core to scrape a website
prefect-webscraper-example
This repository is a complete tutorial of how to use Prefect to scrape a website, while
also deploying it to Prefect Cloud for scheduled orchestration.
This follows mostly the tutorial documented here, but written to run on Prefect Cloud on a schedule:
- https://docs.prefect.io/core/advanced_tutorials/advanced-mapping.html#outline
Following that example, it's writing data to a local SQLite table, which doesn't make much sense when our images are ephemeral, but it should illustrate the pipeline execution. In practice, when orchestrating through Prefect Cloud, we'd likely want to preserve the data to some database or repository that resides on a different dedicated system.
Installation
You'll need a Python environment with the following packages installed.
It's best practice to setup a unique environment for each project. You can accomplish this through Anaconda or pure Python:
Python Virtual Environment
pip install virtualenv
python -m venv prefect-webscraper-example
source activate prefect-webscraper-example/bin/activate
Conda Virtual Environment
conda create -n prefect-webscraper-example python=3.7
source activate prefect-webscraper-example
Package installation
To install the packages, you'll need to use PIP as not all the packages are on the Conda Channels:
pip install -r requirements.txt
Visualization Note
If you want to visualize the DAG, you'll need graphviz installed. This can be done with one command if you're using
conda:
conda install graphviz
If you want to use the pure Python approach, refer to the official documentation here:
- https://graphviz.readthedocs.io/en/stable/manual.html
Examples
BeautifulSoup
The example on Prefect's site leverages the requests library, along with beautifulsoup4. This pattern works for basic
websites that don't involve a lot of JavaScript manipulation of the DOM.
A working example of using BeautifulSoup to parse a website on a schedule in Prefect Cloud is found in:
- example-bs4.py
Selenium
For more modern websites that use a lot of AJAX with JavaScript DOM manipulation, you'll need to simulate execution of the JavaScript, and parse the page as it would load in a traditional browser. For this, there are headless versions of popular web browsers, that allow you to parse it with similar CSS or XPATH syntax.
A working example of using Selenium to parse a website on a schedule in Prefect Cloud is found in:
- example-selenium.py
Selenium Drivers
To leverage Selenium on your local machine, you'll need to download the appropriate driver from their website:
- https://selenium-python.readthedocs.io/installation.html#drivers
In this example, we're using the chromedriver located in the same directory as this code.
When deploying to Prefect Cloud, the reference code will take hints from the official selenium chrome image as a base, then add the Prefect Flow code for the final image that's orchestrated.
This can be viewed in the Dockerfile file.
Project Layout
| TYPE | OBJECT | DESCRIPTION |
|---|---|---|
| 📁 | docker | Non-source code related files used by the Dockerfile during the build process |
| 📄 | build_docker_base_image.sh | Dockerfiles to build a base image for the selenium chrome driver |
| 📄 | Dockerfile | Dockerfiles to build a base image for the selenium chrome driver |
| 📄 | example-bs4.py | Example website scraper Prefect Flow ready for Prefect Cloud using BeautifulSoup |
| 📄 | example-selenium.py | Example website scraper Prefect Flow ready for Prefect Cloud using Selenium |
| 📄 | README.md | This file you're reading now |
| 📄 | requirements.txt | Python packages required for local development of Prefect Flows in this repository |