ColossalAI How to evaluate the effect of PPO training in coati chat

How to evaluate the effect of PPO training in coati chat

Open guijuzhejiang opened this issue 2 years ago • 1 comments

I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal? What does a good PPO training look like? Is there a log that can be confirmed? Episode [1/11]: 100%|██████████| 200/200 [1:41:33<00:00, 30.47s/it] Episode [2/11]: 100%|██████████| 200/200 [1:45:15<00:00, 31.58s/it] Episode [3/11]: 100%|██████████| 200/200 [1:45:56<00:00, 31.78s/it] Episode [4/11]: 100%|██████████| 200/200 [1:45:38<00:00, 31.69s/it] Train epoch [1/2]: 100%|██████████| 1000/1000 [1:23:58<00:00, 5.04s/it, reward=-7.67] Train epoch [2/2]: 100%|██████████| 1000/1000 [1:23:51<00:00, 5.03s/it, reward=-7.74] Episode [5/11]: 100%|██████████| 200/200 [4:33:10<00:00, 81.95s/it] it, reward=-7.74] Episode [6/11]: 100%|██████████| 200/200 [1:44:04<00:00, 31.22s/it] Episode [7/11]: 100%|██████████| 200/200 [1:44:18<00:00, 31.29s/it] Episode [8/11]: 100%|██████████| 200/200 [1:44:10<00:00, 31.25s/it] Episode [9/11]: 100%|██████████| 200/200 [1:44:24<00:00, 31.32s/it] Episode [10/11]: 100%|█████████▉| 199/200 [1:41:53<00:30, 30.14s/it] Train epoch [1/2]: 26%|██▌ | 261/1000 [21:50<1:02:27, 5.07s/it, reward=-188]

Apr 17 '23 01:04 guijuzhejiang

hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress. Debugging RL algorithms is not easy and we are also validating the PPO training, you may use simple environments for testing or vsualize some stats (such as running mean, std, min, max or episode returns, KL of policy update, etc) to check whether the training is correct or not.

Apr 18 '23 07:04 Camille7777

@Camille7777 Hi. Is there options or callbacks which i can check reward_mean, KL or other metrics while training?

Aug 14 '23 03:08 allzero-kwon

ColossalAI ColossalAI copied to clipboard

How to evaluate the effect of PPO training in coati chat

ColossalAI
ColossalAI copied to clipboard