π Describe the bug
When I train the stage3οΌPPOοΌ in chat , the following error occursοΌ
/home/zzg/workspace/pycharm/ColossalAI/applications/Chat/examples/train_prom β
β pts_jp.py:303 in β
β β
β 300 β parser.add_argument('--max_datasets_size', type=int, default=None) β
β 301 β parser.add_argument('--max_len', type=int, default=512) β
β 302 β args = parser.parse_args() β
β β± 303 β main(args) β
β 304 β
β β
β /home/zzg/workspace/pycharm/ColossalAI/applications/Chat/examples/train_prom β
β pts_jp.py:259 in main β
β β
β 256 β β eos_token_id=tokenizer_actor.eos_token_id, β
β 257 β ) β
β 258 β β
β β± 259 β trainer.fit(prompt_dataloader=prompt_dataloader, β
β 260 β β β β pretrain_dataloader=pretrain_dataloader, β
β 261 β β β β num_episodes=args.num_episodes, β
β 262 β β β β max_timesteps=args.max_timesteps, β
β β
β /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/coati/tr οΏ½οΏ½
β ainer/base.py:125 in fit β
β β
β 122 β β β β if time % update_timesteps == 0: β
β 123 β β β β β self.experience_maker.initial_model.to('cpu') β
β 124 β β β β β self.experience_maker.reward_model.to('cpu') β
β β± 125 β β β β β self._learn() β
β 126 β β β β β self.replay_buffer.clear() β
β 127 β β β self._on_episode_end(episode) β
β 128 β β self._on_fit_end() β
β β
β /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/coati/tr β
β ainer/base.py:93 in _learn β
β β
β 90 β β β β for experience in pbar: β
β 91 β β β β β self._on_learn_batch_start() β
β 92 β β β β β experience.to_device(device) β
β β± 93 β β β β β metrics = self.training_step(experience) β
β 94 β β β β β self._on_learn_batch_end(metrics, experience) β
β 95 β β β β β pbar.set_postfix(metrics) β
β 96 β β β β self._on_learn_epoch_end(epoch) β
β β
β /home/zzg/miniconda3/envs/py39_DL_cu118/lib/python3.9/site-packages/coati/tr β
β ainer/ppo.py:103 in training_step β
β β
β 100 β β β label = batch['labels'].to(torch.cuda.current_device())[:, β
β 101 β β β attention_mask = batch['attention_mask'].to(torch.cuda.cur β
β 102 β β β ptx_log_probs = self.actor.get_base_model()(ptx, attention β
β β± 103 β β β ptx_loss = self.ptx_loss_fn(ptx_log_probs.view(-1, ptx_log β
β 104 β β β actor_loss = ptx_loss * self.ptx_coef + actor_loss * (1 - β
β 105 β β β
β 106 β β self.strategy.backward(actor_loss, self.actor, self.actor_opti β
β°βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ―
RuntimeError: view size is not compatible with input tensor's size and stride
(at least one dimension spans across two contiguous subspaces). Use
.reshape(...) instead.
Episode [10/10]: 50%|βββββ | 5/10 [02:32<02:32, 30.43s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1789656 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1789657) of binary: /home/zzg/miniconda3/envs/py39_DL_cu118/bin/python
Environment
CUDAοΌ11.8
pytorcyοΌ1.13.1
transformersοΌ4.29.0.dev0
systemοΌubuntu22
A quick and dirty fix is to modify this line to be
ptx_loss = self.ptx_loss_fn(ptx_log_probs.contiguous().view(-1, ptx_log_probs.size(-1)), label.contiguous().view(-1))
Apr 13
'23 07:04
JThh
@JThh Thank you, it is indeed to modify here, also can use reshape, but reshape seems not efficient
@JThh In addition, how do you recommend setting these parameters: num_episodes, max_epochs, max_timesteps, update_timesteps.
Hi, I recommend going with the defaults or adjusting them based on your needs. Due to the limited training capacity, we cannot provide you with the best practice right now!
Apr 17
'23 09:04
JThh