Hao
Hao
> > 8k训练,可以外推32k > > 请问8k训练的时候base就是1000000吗? Hope this issue could be reopen.
同问,我看代码应该是用了ntk,因为base参数是500w,一般4k训练时用的都是1w吧? 我感到疑惑的是没拓展context的模型的base参数也是500w,并且声称是外推4k->32k,这合理吗?如果长度只拓展8倍,为什么要设置base为500w(而不是大约8w)呢?
I (and most developers) hope the final prompt would be like, making chatml template as an example, ``` user 2+2=? assistant ``` The str in python is ```user\n2+2=?\nassistant\n``` If we...
Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. I apply this PR and the initial loss is quite high, which...
Hi, I fix a bug in my script and now the initial loss is normal *(around 2.3 in arxiv dataset). Thanks for your contribution! also, I have an extra question,...
Some tips (might be helpful) 1. Decrease `actor_rollout_ref.rollout.n` 2. Ensure the setting `export VLLM_ATTENTION_BACKEND=XFORMERS` 3. Decrease `actor_rollout_ref.actor.ppo_micro_batch_size` 4. Decrease `actor_rollout_ref.rollout.log_prob_micro_batch_size` and `actor_rollout_ref.ref.log_prob_micro_batch_size` 5. Decrease `data.max_response_length`