recipe-scrapers
recipe-scrapers copied to clipboard
Support scraping h-recipe data
Rather than supporting only a pre-defined list of websites for which custom scraping has been implemented, it may be useful to also check as a backup whether h-recipe
data is available. This is a microformat which may be embedded in HTML which defines recipes in a structure manner.
If added, the library would:
- Check if a URL is from a site with a custom scraper. If it is, use that scraper.
- Otherwise, check if the page contains
h-recipe
data. If it does, scrape that data. - Otherwise, error out saying that the site is unsupported.
Looks like mf2py
is an official package for parsing microformat data.
extruct
seems to already incorporate mf2py
within it. However, I'm not familiar with the package or this repo enough to say whether it's being used in a way that incorporates it.
Additionally, there would need to be a fall back in the case of #150
FWIW, #150 needn't be considered a blocker for adding h-recipe
support. It'd be nice to find a pure Python approach (I think that'd reduce build times and arguably be safer than including C dependencies) but that could happen in parallel.
Sorta-related: if extruct
could be migrated to pure Python then that might solve multiple problems. After some investigation, html5lib
seems like a potential substitute for lxml
to achieve that, but some performance work might be required before it's suitable for use by extruct
.
Are there any websites that use h-recipe
? I think that the ingredient markup in h-recipe
is better than schema/Recipe.