backend icon indicating copy to clipboard operation
backend copied to clipboard

plan topic-level image support

Open rahulbot opened this issue 4 years ago • 1 comments

Building on #602 and #593, we now want to figure out the requirements for allowing image-based analysis within topics. I reviewed all the old conversations I could find and collected these notes worth keeping in mind:

  • people like the transfer-learning-based clustering approaches
  • the same images are very often used, but cropped differently in different stories
  • the social-sharing image (og:image) is often different from all the others
  • the idea of "top-image" is conceptually helpful for analysis
  • the image ResNet50 similarity stuff uses small (224px square) images
  • people like the ability to see full size images

With regards to back-end pipeline, I'd translate those notes into requirements and next steps like this:

  1. extract all the story-related image URLs from each story in a topic
    • should we make this optional at the topic level? perhaps to save cost?
    • decide whether to use Newspaper3k or roll-our-own
    • make sure we only extract them once for each story
    • decide whether deduplication is worth solving or not
  2. within a story, mark the "top image" and "social sharing image(s)"
    • create a DB table structure that allows for this
  3. store full size images and 224px size images by default
    • @pypt suggests an S3 store for this, re-using a solution we use for other things
    • do some tests to estimate ongoing cost and growth rate
  4. specify API endpoints for retrieval of said images
    • my first thought is to just add an images property to any topic story list results (that'd let us render image tree maps quickly)

A separate task is to design an approach to automatically training an image-embeddings model based on the ResNet50 transfer learning approach we learned from Leon (for each snapshot). I think that still needs investigating and research work; particularly on which similarity algorithm to use and on what to present users to support research. Sometimes they say they want his "mosaics", but other times it seems they want clusters.

What did I miss? Thoughts on these requirements?

rahulbot avatar Jan 28 '20 14:01 rahulbot

More notes on the related project board: https://github.com/berkmancenter/mediacloud/projects/3

rahulbot avatar Feb 18 '20 13:02 rahulbot