Phi-3CookBook
Phi-3CookBook copied to clipboard
Phi3Vision performs well on training/eval dataset but not well in generation/inference post training
Hi Team,
I trained Phi3-Vision on my short video captioning dataset. The training went very well (after multiple attempts), and the model reached a good performance point (i.e.: low train and validation loss as well as a high Bert similarity score on validation dataset comparing the generated caption to ground truth captions) (plz see screenshot below)
However post training, when using the trained model checkpoints to generate captions for videos, I'm getting captions that are mostly incorrect even for those samples from the training and validation datasets that scored very high during training eval!
For inference, I ensured I use the same data prep techniques I used for training (minus teacher forcing) and I use the generate function with greedy decoding, nucleus sampling, etc... but don't seem to get consistent good results (~15% match compared to ~82%during training)
What am I missing? any help is really appreciated!