simple-reddit-crawler
simple-reddit-crawler copied to clipboard
Lightweight Reddit crawler using Python and MySQL
(͡ ° ͜ʖ͡°) Simple Reddit Crawler
Lightweight Reddit crawler using Python and MySQL
Saving Threads:
Run python reader/reader.py /r/yoursubreddithere
Saving Comments:
Run python reader/reader.py --get-comments
How to build
-
git clone
this repository. -
Run the
create-database.sql
script in your MySQL instance -
Install Python pip using
sudo apt-get install python-pip
-
Install PyMySQL using
sudo pip install PyMySQL
-
Open
reader/reader.py
, search foruserAgent = ""
and enter an User-Agent there. Skipping this step will cause Reddit to block your requests.
How the Crawler works
The crawler runs in 2 steps: threads and comments.
When reading Threads:
-
The script reads all the new threads in your subreddit of choice. Reddit limits /new to 25 threads, so only 25 threads are read at a time.
-
Then, it inserts all the threads found in the "threads" table.
-
By checking the ID of the thread given by Reddit (thread_id column in our "threads" table), we detect if that thread has already been read. Duplicate threads are ignored.
When reading Comments:
-
The script loops through all the threads stored in the "threads" table and makes one json request for the comments of each thread.
-
Inserts all the comments in the "comments" table.
-
By checking the ID of the comment given by Reddit (comment_id column in our "comments" table), we detect if that comment has already been read. Duplicate comments are ignored.
Important:
Since Reddit limits the number of json requests to one every two seconds, the process of reading comments becomes increasingly long as more and more threads are posted. This ends up making the coments reading take so long that more than 25 threads are posted in the meantime, making us lose some threads.
To avoid this, we need to first read all the threads during a certain period of time and only after all the threads are in the database, we read their comments.
To do that, run python reader/reader.py /r/yoursubreddithere
to store only the new threads. Leave this script running for as long as you need.
Then, stop it and run python reader/reader.py --get-comments
to store only the comments from the threads read above. Note that this script will run repeatedly to get new comments, so stop its execution when enough comments have been captured.
You can check the result of each run in the logs
table.