tournesol icon indicating copy to clipboard operation
tournesol copied to clipboard

Make a machine learning model to predict/extrapolate the scores

Open glerzing opened this issue 1 year ago • 0 comments

It would be nice to be able to guess the Tournesol score of videos that have not been compared yet. Or to extrapolate the "final" score, when there isn't a lot of comparisons yet (which is harder to do if we want to be careful about biases, and may benefit from the insights of the issue #1474).

We first need to decide which data to use for predictions. It is important to be careful about biases and to avoid being too superficial. So the main source of information should be the actual content of the video, so the captions. But we should also make use of the title, tags, topic category, description, and arguably the channel.

More controversial sources of information include the release date, the number of views, the number of likes, the number of subscribers of the channel, the number of comments or combinations of these (e.g. the ratio of likes per view, or the ratio of comments per view). For these, we may want to decide on a case-by-case basis.

For the model type, we will probably need to combine the results of different weak predictors. Sentence transformers could be fine-tuned to predict the score based on a chunk of the caption, and maybe provide some measure of uncertainty. And to combine the predictions for each chunk, we might use some type of weighted mean.

glerzing avatar Mar 28 '23 01:03 glerzing