ColossalAI
ColossalAI copied to clipboard
How to evaluate the effect of PPO training in coati chat
I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal? What does a good PPO training look like? Is there a log that can be confirmed? Episode [1/11]: 100%|██████████| 200/200 [1:41:33<00:00, 30.47s/it] Episode [2/11]: 100%|██████████| 200/200 [1:45:15<00:00, 31.58s/it] Episode [3/11]: 100%|██████████| 200/200 [1:45:56<00:00, 31.78s/it] Episode [4/11]: 100%|██████████| 200/200 [1:45:38<00:00, 31.69s/it] Train epoch [1/2]: 100%|██████████| 1000/1000 [1:23:58<00:00, 5.04s/it, reward=-7.67] Train epoch [2/2]: 100%|██████████| 1000/1000 [1:23:51<00:00, 5.03s/it, reward=-7.74] Episode [5/11]: 100%|██████████| 200/200 [4:33:10<00:00, 81.95s/it] it, reward=-7.74] Episode [6/11]: 100%|██████████| 200/200 [1:44:04<00:00, 31.22s/it] Episode [7/11]: 100%|██████████| 200/200 [1:44:18<00:00, 31.29s/it] Episode [8/11]: 100%|██████████| 200/200 [1:44:10<00:00, 31.25s/it] Episode [9/11]: 100%|██████████| 200/200 [1:44:24<00:00, 31.32s/it] Episode [10/11]: 100%|█████████▉| 199/200 [1:41:53<00:30, 30.14s/it] Train epoch [1/2]: 26%|██▌ | 261/1000 [21:50<1:02:27, 5.07s/it, reward=-188]
hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress. Debugging RL algorithms is not easy and we are also validating the PPO training, you may use simple environments for testing or vsualize some stats (such as running mean, std, min, max or episode returns, KL of policy update, etc) to check whether the training is correct or not.
@Camille7777 Hi. Is there options or callbacks which i can check reward_mean, KL or other metrics while training?