lc-webscraping
lc-webscraping copied to clipboard
Rework of lesson available for mining into the original
Hi folks,
Last year, I reworked this lesson (https://github.com/resbazSQL/lc-webscraping) as a way of integrating it with the SWC capstone "excel to database." (https://github.com/resbazSQL/capstone-novice-spreadsheet-biblio) My pull request back was rightly rejected for being entirely too large. While I've had "todo: break lesson into commits" on my todo list for the last year, I suppose it's worth noting that the reworked (and taught) lesson is available for other folk (including those working on instructor checkouts) to mine text from.
Here is an incomplete listing of changes:
- Lessons reworked to use perma.cc datasources, so that when the parent pages change, the lesson doesn't break
- Tried to incorporate repeating themes into lesson flow (browser extension, then console, then scrapy) to reinforce learning
- Reduced emphasis on hand-crafting xpaths in the browser console
- Made the scrapy output flow into excel-to-database lesson
- Made the pages refer to multiple countries, to reduce the single-country political focus
I hope it's useful to folks who want to find text to potentially address issues they find. It's unlikely that I'll have time in the next few months to break my edits into a series of commits for proper staging back into main.
@Denubis we still need another Maintainer on this lesson, would you be interested once you finish with your training in the next 6 months? @JoshuaDull and @timtomch are the current Maintainers. I know that @JoshuaDull won't be able to look at reviewing and updating the lesson for another 3-4 months at least. Otherwise, I'll let them respond to your changes.
Yeah, I'd be delighted.
Random aside, at Resbaz this year, one of the other instructors said that he was using my edit for multiple webscraping sessions and it was running well. Is anyone interested in starting discussions for a rework of this lesson, perhaps incorporating content I wrote last year, or maybe splitting the pythonic/xpath parts?