food2vec icon indicating copy to clipboard operation
food2vec copied to clipboard

scraping allrecipes website response errors

Open schnapi opened this issue 8 years ago • 4 comments

I would like to know why I am getting a lot of errors like this when I want to scrape allrecipes.com?

Thanks!

2017-10-27 13:31:38 [allrecipes] DEBUG: No item received for http://allrecipes.com/recipe/16348/baked-pork-chops-i/
2017-10-27 13:31:38 [scrapy.core.scraper] ERROR: Spider error processing <GET http://allrecipes.com/recipe/16348/baked-pork-chops-i/> (referer: http://allrecipes.com/recipes/?page=2)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/mnt/c/Users/Sandi/Desktop/food2vec-master/food2vec-master/dat/RecipesScraper/RecipesScraper/spiders/allrecipes_spider.py", line 33, in parse_item
    if len(data['items']) == 0:
TypeError: list indices must be integers, not str

schnapi avatar Oct 27 '17 11:10 schnapi

2017-10-27 13:36:31 [scrapy.extensions.logstats] INFO: Crawled 382 pages (at 86 pages/min), scraped 0 items (at 0 items/min)
2017-10-27 13:36:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=33> (referer: None)
2017-10-27 13:36:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=34> (referer: None)
2017-10-27 13:36:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=35> (referer: None)
2017-10-27 13:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=36> (referer: None)
2017-10-27 13:36:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=37> (referer: None)
2017-10-27 13:36:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=38> (referer: None)
2017-10-27 13:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=40> (referer: None)
2017-10-27 13:36:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=39> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=41> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=42> (referer: None)
2017-10-27 13:36:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=43> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=45> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=44> (referer: None)
2017-10-27 13:36:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=48> (referer: None)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=46> (referer: None)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipe/13423/my-chili/> (referer: http://allrecipes.com/recipes/?page=34)
2017-10-27 13:36:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://allrecipes.com/recipes/?page=47> (referer: None)
2017-10-27 13:36:58 [scrapy.core.scraper] ERROR: Spider error processing <GET http://allrecipes.com/recipe/13423/my-chili/> (referer: http://allrecipes.com/recipes/?page=34)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/mnt/c/Users/Sandi/Desktop/food2vec-master/food2vec-master/dat/RecipesScraper/RecipesScraper/spiders/allrecipes_spider.py", line 31, in parse_item
    if len(data['items']) == 0:
TypeError: list indices must be integers, not str

schnapi avatar Oct 27 '17 11:10 schnapi

Do you still have all recipes file? Also allrecipes website blocked my ip. Do you have any suggestion how to handle this problem? Thank you!

schnapi avatar Oct 27 '17 11:10 schnapi

Thanks @schnapi -- cc'ing @brandonmburroughs here too in case he's interested (he wrote a great scraper for it).

Let me know if the allrecipes file here works for you:

https://github.com/altosaar/food2vec/tree/master/dat

There are also preprocessing scripts here: https://github.com/altosaar/food2vec/blob/master/src/process_scraped_data.py

jaanli avatar Oct 28 '17 13:10 jaanli

Facing a similar issue here. I wrote a scraper for allrecipes and initially I got data from the website but they have probably blacklisted my IP. Does anyone know a good work-around?

aayushworkiitr avatar May 12 '18 19:05 aayushworkiitr