Smarter-Web-Scraping-with-Python
Smarter-Web-Scraping-with-Python copied to clipboard
Leverage modern open-source tools to create better web scraping workflows.
Smater Web Scraping with Python Selenium and Llama2
Generate podcast clips related to daily top submissions on Hacker News via web scraping with Python & Selenium, generative ai with Ollama and LLama2, Transcript generation OpenAI Whisper, iTunes Podcast Search, and more.
Coming soon
Requirements
- Python 3.10 and up
- A Bright Data Account (includes $25 credit)
- ffmpeg (required for transcribing audio with OpenAI Whisper)
A Proxy-based Web Scraping approach
In this repo, we use a web scraping proxy service from Bright Data. Using a proxy service makes our requests more reliable. You can see the actual code for the Selenium-based remote connection here src/helpers/brightdata.py.
With Remote Proxy
our computer -> request -> proxy -> web server -> proxy -> response -> our computer
Without Remote Proxy
our computer -> request -> web server -> response -> our computer
Usage
# from 'src/2 - Connection Sample.ipynb'
from selenium.webdriver import Remote, ChromeOptions
# import this function
from helpers.brightdata import get_sbr_connection
options = ChromeOptions()
# options.headless = True # old method
options.add_argument("--headless=new") # new method
url = 'https://news.ycombinator.com'
with Remote(sbr_connection, options=options) as driver:
driver.get(url)
print(driver.page_source)
Getting Started
Clone project
mkdir -p ~/dev/smarter-scraping
cd ~/dev/smarter-scraping
git clone https://github.com/codingforentrepreneurs/Smarter-Web-Scraping-with-Python .
(Optional) Working through the course?
Use the course_start branch with:
mac/linux
git checkout course_start
rm -rf .git
git init
windows
git checkout course_start
Remove-Item .git -Recurse -Force
git init
Create a Python Virtual Environment
cd ~/dev/smarter-scraping # or where you cloned the repo
mac/linux
python3 -m venv venv
windows
c:\Python311\python.exe -m venv venv
Activate the virtual enviornment
Always activate your environment!
cd ~/dev/smarter-scraping # or where you cloned the repo
mac/linux
source venv/bin/activate
windows
.\venv\Scripts\activate
If done correctly, your command line should start with (venv)
Install requirements
With virtual envionoment activated (e.g. (venv)), run:
(venv) python -m pip install pip --upgrade
(venv) python -m pip install -r requirements.txt
Implement Environment Variables with dotenv
mac/linux
cp sample-env-file .env
windows
Copy-Item .env.sample -Destination .env
Be sure to add your Bright Data proxy information:
BRIGHT_DATA_USERNAMEBRIGHT_DATA_PASSWORDBRIGHT_DATA_HOST
Add Ollama data too (for Running the OpenAI drop-in replacement Llama2)
OPENAI_BASE_URL=http://localhost:11434/v1OPENAI_API_KEY=ollamaOPENAI_COMPLETION_MODEL=llama2
Loading Environment Variables
With code that lives inside the src/ directory, you can import the helpers module to load your environment variables.
We created a simple function to extend the incredible python-decouple package (it's in src/helpers/env.py):
import helpers
MY_VAR = helpers.config('MY_VAR', default="Not set", cast=str)
Run Jupyter
Explore the notebooks!
jupyter notebook