food-inspector
food-inspector copied to clipboard
Add scraper celery task to update Durham images
Not sure how to approach this one... I used a sitescraper to reverse the address into property_id. I don't mind sharing that code so we could incorporate that.
However:
- Running it for the already processed establishments is pointless, as their addresses usually don't change. Thus the property_id stays the same, and the image won't change. We use the most recent image that Durham County has for a given property_id (we actually hotlink to their site...)
- We could run it on import for new establishments / the ones that moved in an async fashion (even through celery).
I think one of the problems we have is that some of the images are no longer found. If we were to scrape the images in a regular basis, every week or every other week, we could update which properties do have an image and which one don't and maybe even get images for establishment that did not have one previously.
On Wed, May 21, 2014 at 8:15 PM, Marek Laska [email protected]:
Not sure how to approach this one... I used a sitescraper to reverse the address into property_id. I don't mind sharing that code so we could incorporate that.
However:
Running it for the already processed establishments is pointless, as their addresses usually don't change. Thus the property_id stays the same, and the image won't change. We use the most recent image that Durham County has for a given property_id (we actually hotlink to their site...)
We could run it on import for new establishments / the ones that moved in an async fashion (even through celery).
— Reply to this email directly or view it on GitHubhttps://github.com/codefordurham/Durham-Restaurants/issues/113#issuecomment-43833356 .
OK, to summarize this and #112:
- we need to separate the picture url logic from the view into the model
- at update / manually:
- a) for new establishments (address updates) a site scraper tries to determine property_id
- b) if establishment has property_id, check if Durham County is serving an image or 404 and populate the image_url
- c) if no
property_id
- image_url stays blank
Does this sound reasonable?
Sounds right. I'd suggest the scraping/photo_url check be it's own standalone task in eatsmart.locations.durham
that's run on a regular interval (like once night). It can just skip over establishments with valid photo_urls.