food-inspector icon indicating copy to clipboard operation
food-inspector copied to clipboard

Add scraper celery task to update Durham images

Open copelco opened this issue 10 years ago • 4 comments

copelco avatar May 08 '14 01:05 copelco

Not sure how to approach this one... I used a sitescraper to reverse the address into property_id. I don't mind sharing that code so we could incorporate that.

However:

  • Running it for the already processed establishments is pointless, as their addresses usually don't change. Thus the property_id stays the same, and the image won't change. We use the most recent image that Durham County has for a given property_id (we actually hotlink to their site...)
  • We could run it on import for new establishments / the ones that moved in an async fashion (even through celery).

tamcap avatar May 22 '14 00:05 tamcap

I think one of the problems we have is that some of the images are no longer found. If we were to scrape the images in a regular basis, every week or every other week, we could update which properties do have an image and which one don't and maybe even get images for establishment that did not have one previously.

On Wed, May 21, 2014 at 8:15 PM, Marek Laska [email protected]:

Not sure how to approach this one... I used a sitescraper to reverse the address into property_id. I don't mind sharing that code so we could incorporate that.

However:

Running it for the already processed establishments is pointless, as their addresses usually don't change. Thus the property_id stays the same, and the image won't change. We use the most recent image that Durham County has for a given property_id (we actually hotlink to their site...)

We could run it on import for new establishments / the ones that moved in an async fashion (even through celery).

— Reply to this email directly or view it on GitHubhttps://github.com/codefordurham/Durham-Restaurants/issues/113#issuecomment-43833356 .

vrocha avatar May 22 '14 14:05 vrocha

OK, to summarize this and #112:

  • we need to separate the picture url logic from the view into the model
  • at update / manually:
  • a) for new establishments (address updates) a site scraper tries to determine property_id
  • b) if establishment has property_id, check if Durham County is serving an image or 404 and populate the image_url
  • c) if no property_id - image_url stays blank

Does this sound reasonable?

tamcap avatar May 23 '14 02:05 tamcap

Sounds right. I'd suggest the scraping/photo_url check be it's own standalone task in eatsmart.locations.durham that's run on a regular interval (like once night). It can just skip over establishments with valid photo_urls.

copelco avatar May 23 '14 02:05 copelco