whereis-whoishiring-hiring icon indicating copy to clipboard operation
whereis-whoishiring-hiring copied to clipboard

Update scraping to support HN's layout changes

Open cooperra opened this issue 5 years ago • 1 comments

Three parts to this PR:

  1. I froze the requirements versions because I wasn't able to run the project otherwise. I'm sure some packages could be updated, but I didn't look into it.
  2. Scraping changes due to HN updates
    • The comment span is now a div
    • The post selector had to be constrained (class comtr) because it was picking up table rows that contained pagination links.
  3. Pagination support
    • The scraper now traverses "More" links on a page if there are any and continues scraping until it reaches the end of the comments.
    • The new logic is in extract_jobs_from_thread(s)

Disclaimer: When I tested this, it eventually timed out after successfully processing 7 months of posts. Perhaps it hit a rate limit or bad luck. You might have to load the production database one month at a time to catch up to the present month.

Sidenote: I didn't know this was broken until I recently needed the "Who's hiring?" thread again. Thanks for making it! :slightly_smiling_face:

fixes #4

cooperra avatar Oct 09 '19 20:10 cooperra

@cooperra niiiice thanks for doing this! 🙏 - it's good to see this super-old side project getting some love :)

I'll try to merge and get the site back up later this week and I'll let you know.

oilnam avatar Oct 15 '19 10:10 oilnam