Hugging-Captions
Hugging-Captions copied to clipboard
Rank the captions
From the README:
"Some of the generated captions are going to be ugly. Some of the generated captions are going to be really good but a word or two simply does not make sense. This is expected no matter how much the data, both training and generated, is cleaned."
Why don't you simply use something such as lm-scorer to sort the captions according to their score? In this way nonsense captions will be pushed to the bottom of /Hugging-Captions/text/generated_text/<tag>_gen.txt
.
@ruanchaves lm-scorer seems like it would indeed be a good solution to this problem. It is not going to change the fact that some of the captions generated will be ugly but it may help in bubbling up the better captions to the top. I'll test it out. Thanks.
Just be careful to from lm_scorer.models.auto import GPT2LMScorer as LMScorer
if you're using a fine-tuned model. ( This is not documented and took me a few minutes to notice ).
Sounds good. I am going to play around with this some time tomorrow.
@ruanchaves I have tested lm-scorer out and I do not think it is worth implementing. It does not seem that a good score necessarily correlates with a quality caption. Captions are of a subjective nature and I do not think that a probabilistic method that scores the probability of a group of tokens/sequence of tokens can easily capture this.
Did you try scorer.sentence_score("I like this package.", log=True)
? For some reason, some options work better than others ( maybe a bug ). Otherwise, you're probably right, maybe it's just not applicable.
Yes that is exactly what I used. Appreciate the suggestion.