mayank goyal
mayank goyal
any plans to support this on text generation inference
Above were training logs. Eval logs also looks the same. Rewards chosen is decreasing. "log_history": [ { "epoch": 0.09, "eval_logits/chosen": -0.8852691054344177, "eval_logits/rejected": -0.8777562379837036, "eval_logps/chosen": -28.69784927368164, "eval_logps/rejected": -106.26754760742188, "eval_loss": 0.2819069027900696, "eval_rewards/accuracies":...
Information about model and training: Task: Question and Answering from context of documents. Architecture: Llama-2-7B-Chat Finetuning: Standard Lora finetuning on DPO loss. Beta was 0.1. Yes SFT was done on...
@eric-mitchell let me know if my understanding is correct.