the-waterloo-blogger
the-waterloo-blogger copied to clipboard
First attempt at scraping all the blogs
I saw that you had future plans for scraping the blogs, and decided to take a stab at the problem. As you mentioned, the blogs have a lot of different formats so it'd be difficult to correctly extract the title / body for all of them to form an aggregate.
My approach is to extract only the dates (eg: "August 10, 2017") from all the homepages, which is fairly easy. Then we can sort the blogs by when their most recent post is. I coded a quick prototype of this idea, and was able to find the 5 most recently-updated blogs (excluding Medium blogs which are blocked in my country):
- http://ivebeenbit.ca/
- http://waterloowhynot.tumblr.com/
- https://anzoteh96.wordpress.com
- https://amosunov.wordpress.com/category/blog/
- https://adventuresthatlieahead.wordpress.com/
LMK if you want to work on this together!
@luckytoilet That approach is quite clever. I hadn't thought of looking for dates, which probably gets you the right posts >95% of the time. (I was thinking of looking for repeated DOM structure, which would be more complicated, maybe even making a sort of "HTML regex template" DSL).
I do want to work on this, but I currently also want to wrap up my projects from Recurse Center and avoid taking on too many things =/
I got a basic version set up here, do you want to link to it on the main page?