the-waterloo-blogger icon indicating copy to clipboard operation
the-waterloo-blogger copied to clipboard

First attempt at scraping all the blogs

Open lucky-bai opened this issue 8 years ago • 2 comments

I saw that you had future plans for scraping the blogs, and decided to take a stab at the problem. As you mentioned, the blogs have a lot of different formats so it'd be difficult to correctly extract the title / body for all of them to form an aggregate.

My approach is to extract only the dates (eg: "August 10, 2017") from all the homepages, which is fairly easy. Then we can sort the blogs by when their most recent post is. I coded a quick prototype of this idea, and was able to find the 5 most recently-updated blogs (excluding Medium blogs which are blocked in my country):

  1. http://ivebeenbit.ca/
  2. http://waterloowhynot.tumblr.com/
  3. https://anzoteh96.wordpress.com
  4. https://amosunov.wordpress.com/category/blog/
  5. https://adventuresthatlieahead.wordpress.com/

LMK if you want to work on this together!

lucky-bai avatar Aug 16 '17 13:08 lucky-bai

@luckytoilet That approach is quite clever. I hadn't thought of looking for dates, which probably gets you the right posts >95% of the time. (I was thinking of looking for repeated DOM structure, which would be more complicated, maybe even making a sort of "HTML regex template" DSL).

I do want to work on this, but I currently also want to wrap up my projects from Recurse Center and avoid taking on too many things =/

rudi-c avatar Aug 17 '17 03:08 rudi-c

I got a basic version set up here, do you want to link to it on the main page?

lucky-bai avatar Aug 25 '17 21:08 lucky-bai