recipes
recipes copied to clipboard
Avoid off-by-one when scraping 'servings_text'
This was showing something like servings=4 and servings_text="['4']" for all recipes I imported. Use the first item in the list instead of the second-- my list only has one entry which threw an exception, dropped it, and stringified the list. This matches the behavior of servings.
Interesting, thanks for the PR. I guess that using 1 as an index was done on purpose at some point because the person implementing it (probably smilerz or me) had test data that was like ["1","pcs"].
What do you think about looping the list for an entry and trying do some regex matching to find the best fit (contains number/is only number/ does not contain a number) to potentially improve the results?
I'm not super keen on writing a heuristic, since I don't have a lot of experience on real-world data here and I'm not entirely sure what the intent of the field is (original text? units?). I also may need to change the numeric equivalent depending on the heuristic (which just takes the first item today).
Here are a few more options, do any sound good?
- Last item in list (maybe what the last one meant?)
- String of all items in list (
" ".join(...)) - String of all items in list, deduplicated.
Interesting, I like your second option combined with removing the first item that looks like a number, do you think that makes sense?
So
- remove the first item in the list that regex matches a number (in any style with , and . as seperators)
- join the rest of the list into one string with spaces as delimiter