DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Why is the PPL so high in the beginning of Step-1 (SFT)?
https://github.com/microsoft/DeepSpeedExamples/blob/737c6740bec38b77a24a59135b6481a53d566b38/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_log_output/opt-1.3b-globalBatchSize128.log#L4
Why is the PPL here 4k when we are starting with a pretrained model?