backend plan topic-level image support

plan topic-level image support

Open rahulbot opened this issue 4 years ago • 1 comments

Building on #602 and #593, we now want to figure out the requirements for allowing image-based analysis within topics. I reviewed all the old conversations I could find and collected these notes worth keeping in mind:

people like the transfer-learning-based clustering approaches
the same images are very often used, but cropped differently in different stories
the social-sharing image (og:image) is often different from all the others
the idea of "top-image" is conceptually helpful for analysis
the image ResNet50 similarity stuff uses small (224px square) images
people like the ability to see full size images

With regards to back-end pipeline, I'd translate those notes into requirements and next steps like this:

extract all the story-related image URLs from each story in a topic
- should we make this optional at the topic level? perhaps to save cost?
- decide whether to use Newspaper3k or roll-our-own
- make sure we only extract them once for each story
- decide whether deduplication is worth solving or not
within a story, mark the "top image" and "social sharing image(s)"
- create a DB table structure that allows for this
store full size images and 224px size images by default
- @pypt suggests an S3 store for this, re-using a solution we use for other things
- do some tests to estimate ongoing cost and growth rate
specify API endpoints for retrieval of said images
- my first thought is to just add an images property to any topic story list results (that'd let us render image tree maps quickly)

A separate task is to design an approach to automatically training an image-embeddings model based on the ResNet50 transfer learning approach we learned from Leon (for each snapshot). I think that still needs investigating and research work; particularly on which similarity algorithm to use and on what to present users to support research. Sometimes they say they want his "mosaics", but other times it seems they want clusters.

What did I miss? Thoughts on these requirements?

Jan 28 '20 14:01 rahulbot

More notes on the related project board: https://github.com/berkmancenter/mediacloud/projects/3

Feb 18 '20 13:02 rahulbot

backend backend copied to clipboard

plan topic-level image support

backend
backend copied to clipboard