Video-LLaVA icon indicating copy to clipboard operation
Video-LLaVA copied to clipboard

How to limit the generated text token to a maximum of 77?

Open song-wensong opened this issue 1 year ago • 1 comments

When I show Video-LLava a short video, given inp = 'Could you please provide a detailed description for this video? Your comprehensive video caption should allow listeners to visualize the scene without actually watching the video. Note that the generated text tokens should not exceed 77!' But I found that the length of the text tokens it generated was always greater than 77. How should I input inp or adjust the model to make its output meet my requirements? (Because I want to use CLIP to process the generated text tokens later, I want to limit the length to within 77.)

song-wensong avatar Feb 23 '24 13:02 song-wensong

Sorry, this is a known issue. It may not be able to follow instructions well due to too little video fine tuning data.

LinB203 avatar Feb 26 '24 05:02 LinB203