recipe-scrapers icon indicating copy to clipboard operation
recipe-scrapers copied to clipboard

Support scraping h-recipe data

Open alilleybrinker opened this issue 4 years ago • 4 comments

Rather than supporting only a pre-defined list of websites for which custom scraping has been implemented, it may be useful to also check as a backup whether h-recipe data is available. This is a microformat which may be embedded in HTML which defines recipes in a structure manner.

If added, the library would:

  1. Check if a URL is from a site with a custom scraper. If it is, use that scraper.
  2. Otherwise, check if the page contains h-recipe data. If it does, scrape that data.
  3. Otherwise, error out saying that the site is unsupported.

alilleybrinker avatar Apr 16 '20 17:04 alilleybrinker

Looks like mf2py is an official package for parsing microformat data.

alilleybrinker avatar Apr 16 '20 17:04 alilleybrinker

extruct seems to already incorporate mf2py within it. However, I'm not familiar with the package or this repo enough to say whether it's being used in a way that incorporates it.

Additionally, there would need to be a fall back in the case of #150

bfcarpio avatar Jul 13 '20 19:07 bfcarpio

FWIW, #150 needn't be considered a blocker for adding h-recipe support. It'd be nice to find a pure Python approach (I think that'd reduce build times and arguably be safer than including C dependencies) but that could happen in parallel.

Sorta-related: if extruct could be migrated to pure Python then that might solve multiple problems. After some investigation, html5lib seems like a potential substitute for lxml to achieve that, but some performance work might be required before it's suitable for use by extruct.

jayaddison avatar Dec 22 '20 13:12 jayaddison

Are there any websites that use h-recipe? I think that the ingredient markup in h-recipe is better than schema/Recipe.

micahcochran avatar Jan 07 '21 02:01 micahcochran