ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

How to evaluate the effect of PPO training in coati chat

Open guijuzhejiang opened this issue 1 year ago • 1 comments

I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal? What does a good PPO training look like? Is there a log that can be confirmed? Episode [1/11]: 100%|██████████| 200/200 [1:41:33<00:00, 30.47s/it] Episode [2/11]: 100%|██████████| 200/200 [1:45:15<00:00, 31.58s/it] Episode [3/11]: 100%|██████████| 200/200 [1:45:56<00:00, 31.78s/it] Episode [4/11]: 100%|██████████| 200/200 [1:45:38<00:00, 31.69s/it] Train epoch [1/2]: 100%|██████████| 1000/1000 [1:23:58<00:00, 5.04s/it, reward=-7.67] Train epoch [2/2]: 100%|██████████| 1000/1000 [1:23:51<00:00, 5.03s/it, reward=-7.74] Episode [5/11]: 100%|██████████| 200/200 [4:33:10<00:00, 81.95s/it] it, reward=-7.74] Episode [6/11]: 100%|██████████| 200/200 [1:44:04<00:00, 31.22s/it] Episode [7/11]: 100%|██████████| 200/200 [1:44:18<00:00, 31.29s/it] Episode [8/11]: 100%|██████████| 200/200 [1:44:10<00:00, 31.25s/it] Episode [9/11]: 100%|██████████| 200/200 [1:44:24<00:00, 31.32s/it] Episode [10/11]: 100%|█████████▉| 199/200 [1:41:53<00:30, 30.14s/it] Train epoch [1/2]: 26%|██▌ | 261/1000 [21:50<1:02:27, 5.07s/it, reward=-188]

guijuzhejiang avatar Apr 17 '23 01:04 guijuzhejiang

hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress. Debugging RL algorithms is not easy and we are also validating the PPO training, you may use simple environments for testing or vsualize some stats (such as running mean, std, min, max or episode returns, KL of policy update, etc) to check whether the training is correct or not.

Camille7777 avatar Apr 18 '23 07:04 Camille7777

@Camille7777 Hi. Is there options or callbacks which i can check reward_mean, KL or other metrics while training?

allzero-kwon avatar Aug 14 '23 03:08 allzero-kwon