backend
backend copied to clipboard
plan topic-level image support
Building on #602 and #593, we now want to figure out the requirements for allowing image-based analysis within topics. I reviewed all the old conversations I could find and collected these notes worth keeping in mind:
- people like the transfer-learning-based clustering approaches
- the same images are very often used, but cropped differently in different stories
- the social-sharing image (
og:image
) is often different from all the others - the idea of "top-image" is conceptually helpful for analysis
- the image ResNet50 similarity stuff uses small (224px square) images
- people like the ability to see full size images
With regards to back-end pipeline, I'd translate those notes into requirements and next steps like this:
- extract all the story-related image URLs from each story in a topic
- should we make this optional at the topic level? perhaps to save cost?
- decide whether to use Newspaper3k or roll-our-own
- make sure we only extract them once for each story
- decide whether deduplication is worth solving or not
- within a story, mark the "top image" and "social sharing image(s)"
- create a DB table structure that allows for this
- store full size images and 224px size images by default
- @pypt suggests an S3 store for this, re-using a solution we use for other things
- do some tests to estimate ongoing cost and growth rate
- specify API endpoints for retrieval of said images
- my first thought is to just add an
images
property to any topic story list results (that'd let us render image tree maps quickly)
- my first thought is to just add an
A separate task is to design an approach to automatically training an image-embeddings model based on the ResNet50 transfer learning approach we learned from Leon (for each snapshot
). I think that still needs investigating and research work; particularly on which similarity algorithm to use and on what to present users to support research. Sometimes they say they want his "mosaics", but other times it seems they want clusters.
What did I miss? Thoughts on these requirements?
More notes on the related project board: https://github.com/berkmancenter/mediacloud/projects/3