Phi-3CookBook icon indicating copy to clipboard operation
Phi-3CookBook copied to clipboard

Phi3Vision performs well on training/eval dataset but not well in generation/inference post training

Open hm-ca opened this issue 6 months ago • 4 comments

Hi Team,

I trained Phi3-Vision on my short video captioning dataset. The training went very well (after multiple attempts), and the model reached a good performance point (i.e.: low train and validation loss as well as a high Bert similarity score on validation dataset comparing the generated caption to ground truth captions) (plz see screenshot below)

However post training, when using the trained model checkpoints to generate captions for videos, I'm getting captions that are mostly incorrect even for those samples from the training and validation datasets that scored very high during training eval!

For inference, I ensured I use the same data prep techniques I used for training (minus teacher forcing) and I use the generate function with greedy decoding, nucleus sampling, etc... but don't seem to get consistent good results (~15% match compared to ~82%during training)

What am I missing? any help is really appreciated!

Screenshot 2024-08-12 at 1 02 29 PM

hm-ca avatar Aug 12 '24 17:08 hm-ca