recipes icon indicating copy to clipboard operation
recipes copied to clipboard

Avoid off-by-one when scraping 'servings_text'

Open jmerdich opened this issue 1 year ago • 4 comments

This was showing something like servings=4 and servings_text="['4']" for all recipes I imported. Use the first item in the list instead of the second-- my list only has one entry which threw an exception, dropped it, and stringified the list. This matches the behavior of servings.

jmerdich avatar Oct 13 '24 16:10 jmerdich

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Oct 13 '24 16:10 CLAassistant

Interesting, thanks for the PR. I guess that using 1 as an index was done on purpose at some point because the person implementing it (probably smilerz or me) had test data that was like ["1","pcs"].

What do you think about looping the list for an entry and trying do some regex matching to find the best fit (contains number/is only number/ does not contain a number) to potentially improve the results?

vabene1111 avatar Oct 14 '24 06:10 vabene1111

I'm not super keen on writing a heuristic, since I don't have a lot of experience on real-world data here and I'm not entirely sure what the intent of the field is (original text? units?). I also may need to change the numeric equivalent depending on the heuristic (which just takes the first item today).

Here are a few more options, do any sound good?

  • Last item in list (maybe what the last one meant?)
  • String of all items in list (" ".join(...))
  • String of all items in list, deduplicated.

jmerdich avatar Oct 14 '24 20:10 jmerdich

Interesting, I like your second option combined with removing the first item that looks like a number, do you think that makes sense?

So

  1. remove the first item in the list that regex matches a number (in any style with , and . as seperators)
  2. join the rest of the list into one string with spaces as delimiter

vabene1111 avatar Oct 15 '24 05:10 vabene1111