ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: PPO errors

Open guijuzhejiang opened this issue 2 years ago β€’ 4 comments

πŸ› Describe the bug

When I train the stage3(PPOοΌ‰ in chat , the following error occurs: /home/zzg/workspace/pycharm/ColossalAI/applications/Chat/examples/train_prom β”‚ β”‚ pts_jp.py:303 in β”‚ β”‚ β”‚ β”‚ 300 β”‚ parser.add_argument('--max_datasets_size', type=int, default=None) β”‚ β”‚ 301 β”‚ parser.add_argument('--max_len', type=int, default=512) β”‚ β”‚ 302 β”‚ args = parser.parse_args() β”‚ β”‚ ❱ 303 β”‚ main(args) β”‚ β”‚ 304 β”‚ β”‚ β”‚ β”‚ /home/zzg/workspace/pycharm/ColossalAI/applications/Chat/examples/train_prom β”‚ β”‚ pts_jp.py:259 in main β”‚ β”‚ β”‚ β”‚ 256 β”‚ β”‚ eos_token_id=tokenizer_actor.eos_token_id, β”‚ β”‚ 257 β”‚ ) β”‚ β”‚ 258 β”‚ β”‚ β”‚ ❱ 259 β”‚ trainer.fit(prompt_dataloader=prompt_dataloader, β”‚ β”‚ 260 β”‚ β”‚ β”‚ β”‚ pretrain_dataloader=pretrain_dataloader, β”‚ β”‚ 261 β”‚ β”‚ β”‚ β”‚ num_episodes=args.num_episodes, β”‚ β”‚ 262 β”‚ β”‚ β”‚ β”‚ max_timesteps=args.max_timesteps, β”‚ β”‚ β”‚ β”‚ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/coati/tr οΏ½οΏ½ β”‚ ainer/base.py:125 in fit β”‚ β”‚ β”‚ β”‚ 122 β”‚ β”‚ β”‚ β”‚ if time % update_timesteps == 0: β”‚ β”‚ 123 β”‚ β”‚ β”‚ β”‚ β”‚ self.experience_maker.initial_model.to('cpu') β”‚ β”‚ 124 β”‚ β”‚ β”‚ β”‚ β”‚ self.experience_maker.reward_model.to('cpu') β”‚ β”‚ ❱ 125 β”‚ β”‚ β”‚ β”‚ β”‚ self._learn() β”‚ β”‚ 126 β”‚ β”‚ β”‚ β”‚ β”‚ self.replay_buffer.clear() β”‚ β”‚ 127 β”‚ β”‚ β”‚ self._on_episode_end(episode) β”‚ β”‚ 128 β”‚ β”‚ self._on_fit_end() β”‚ β”‚ β”‚ β”‚ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/coati/tr β”‚ β”‚ ainer/base.py:93 in _learn β”‚ β”‚ β”‚ β”‚ 90 β”‚ β”‚ β”‚ β”‚ for experience in pbar: β”‚ β”‚ 91 β”‚ β”‚ β”‚ β”‚ β”‚ self._on_learn_batch_start() β”‚ β”‚ 92 β”‚ β”‚ β”‚ β”‚ β”‚ experience.to_device(device) β”‚ β”‚ ❱ 93 β”‚ β”‚ β”‚ β”‚ β”‚ metrics = self.training_step(experience) β”‚ β”‚ 94 β”‚ β”‚ β”‚ β”‚ β”‚ self._on_learn_batch_end(metrics, experience) β”‚ β”‚ 95 β”‚ β”‚ β”‚ β”‚ β”‚ pbar.set_postfix(metrics) β”‚ β”‚ 96 β”‚ β”‚ β”‚ β”‚ self._on_learn_epoch_end(epoch) β”‚ β”‚ β”‚ β”‚ /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/coati/tr β”‚ β”‚ ainer/ppo.py:103 in training_step β”‚ β”‚ β”‚ β”‚ 100 β”‚ β”‚ β”‚ label = batch['labels'].to(torch.cuda.current_device())[:, β”‚ β”‚ 101 β”‚ β”‚ β”‚ attention_mask = batch['attention_mask'].to(torch.cuda.cur β”‚ β”‚ 102 β”‚ β”‚ β”‚ ptx_log_probs = self.actor.get_base_model()(ptx, attention β”‚ β”‚ ❱ 103 β”‚ β”‚ β”‚ ptx_loss = self.ptx_loss_fn(ptx_log_probs.view(-1, ptx_log β”‚ β”‚ 104 β”‚ β”‚ β”‚ actor_loss = ptx_loss * self.ptx_coef + actor_loss * (1 - β”‚ β”‚ 105 β”‚ β”‚ β”‚ β”‚ 106 β”‚ β”‚ self.strategy.backward(actor_loss, self.actor, self.actor_opti β”‚ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead. Episode [10/10]: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 5/10 [02:32<02:32, 30.43s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1789656 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1789657) of binary: /home/zzg/miniconda3/envs/py39_DL_cu118/bin/python

Environment

CUDA:11.8 pytorcy:1.13.1 transformers:4.29.0.dev0 system:ubuntu22

guijuzhejiang avatar Apr 13 '23 04:04 guijuzhejiang

A quick and dirty fix is to modify this line to be

ptx_loss = self.ptx_loss_fn(ptx_log_probs.contiguous().view(-1, ptx_log_probs.size(-1)), label.contiguous().view(-1))

JThh avatar Apr 13 '23 07:04 JThh

@JThh Thank you, it is indeed to modify here, also can use reshape, but reshape seems not efficient

guijuzhejiang avatar Apr 13 '23 08:04 guijuzhejiang

@JThh In addition, how do you recommend setting these parameters: num_episodes, max_epochs, max_timesteps, update_timesteps.

guijuzhejiang avatar Apr 13 '23 08:04 guijuzhejiang

Hi, I recommend going with the defaults or adjusting them based on your needs. Due to the limited training capacity, we cannot provide you with the best practice right now!

JThh avatar Apr 17 '23 09:04 JThh