recipe-scrapers
recipe-scrapers copied to clipboard
Scraper Development Guide
Hi guys
One of the suggestions for improving the developer experience in #617 is writing some developer guidance documentation, it's also been mentioned in a few issues and PRs lately, so I though I would have a go at starting something.
I've come up with a rough outline of what the docs could cover:
- A step by step guide to developing a new scraper. This would start from identifying a website, and cover generating the scraper and tests, adding functionality to the scraper, adding functionality to the test cases. This would be the main piece of documentation, and it would then link out to some more in depth articles to cover the following specific topics:
- A more detailed definition of what the
Scraper
methods are and what they should return (in terms of datatypes and content) and whichScraper
methods are 'mandatory' (e.g. title, ingredients, instructions ...) and which are more 'optional' (e.g. ingredient groups, ratings, reviews ...). - A more detailed guide on scraping from the html. I see this being a bit like a cookbook of common patterns and best practice.
- A detailed guide for adding ingredient groups. This would effectively take the guidance I wrote in #799 and tidying it up.
- A more detailed guide on debugging scraper during development.
A couple of questions I have:
- What format should this take? a. Github wiki? b. Markdown files in a docs folder? c. Sphinx (or similar) generated pages?
- Are there any topics people would like to see covered that I haven't mentioned above?
Progress
- [x] Step by step guide for developing scraper (#862)
- [x] Detailed guide: scraper functions (#862)
- [x] Detailed guide: ingredient groups (#862)
- [x] Detailed guide: HTML scraping (#862)
- [ ] Detailed guide: debugging
Contributions for any of the current unwritten guides or any additional documentation is welcome.
What format should this take?
I'd vote for markdown files within the repository, with a wiki as my second preference.
Reasoning: markdown is fairly straightforward and readable with or without supporting tooling, and GitHub previews it automatically, meaning that casual visitors to our repository could read it effectively too. It's also available while working with the code (whether in an IDE, online, or command-line), a benefit over the web-based wiki. Finally: some documentation changes are closely related to code changes, and the ability to include both in the same pull request / commit (when beneficial) could be useful.
(also: thanks for getting this discussion going!)
Thanks @jayaddison.
I'm glad you've voted for markdown files, as that was my preference too. I've created a draft PR #862 with a starting point and I'll continue adding to it as I get chance.
I second the markdown files yep. I feel like mkdocs + material theme seems to be the pick nowadays in the python community. I'd vote for that specific combo with search plugin included. Sounds like a nice starting point.
@strangetom maybe worth updating the issue description to use a Markdown checklist, and ticking off the items completed? (most of them :)) I'm thinking it might help some other contributor to see where they can help.
Updated :)