recipes Avoid off-by-one when scraping 'servings

Avoid off-by-one when scraping 'servings_text'

Open jmerdich opened this issue 1 year ago • 4 comments

This was showing something like servings=4 and servings_text="['4']" for all recipes I imported. Use the first item in the list instead of the second-- my list only has one entry which threw an exception, dropped it, and stringified the list. This matches the behavior of servings.

Oct 13 '24 16:10 jmerdich

All committers have signed the CLA.

Oct 13 '24 16:10 CLAassistant

Interesting, thanks for the PR. I guess that using 1 as an index was done on purpose at some point because the person implementing it (probably smilerz or me) had test data that was like ["1","pcs"].

What do you think about looping the list for an entry and trying do some regex matching to find the best fit (contains number/is only number/ does not contain a number) to potentially improve the results?

Oct 14 '24 06:10 vabene1111

I'm not super keen on writing a heuristic, since I don't have a lot of experience on real-world data here and I'm not entirely sure what the intent of the field is (original text? units?). I also may need to change the numeric equivalent depending on the heuristic (which just takes the first item today).

Here are a few more options, do any sound good?

Last item in list (maybe what the last one meant?)
String of all items in list (" ".join(...))
String of all items in list, deduplicated.

Oct 14 '24 20:10 jmerdich

Interesting, I like your second option combined with removing the first item that looks like a number, do you think that makes sense?

remove the first item in the list that regex matches a number (in any style with , and . as seperators)
join the rest of the list into one string with spaces as delimiter

Oct 15 '24 05:10 vabene1111

recipes recipes copied to clipboard

Avoid off-by-one when scraping 'servings_text'

recipes
recipes copied to clipboard