Hao comments

Results 6 comments of

Hao

trafficstars

72B模型是预训练阶段就完全用32k窗口的吗？

> > 8k训练，可以外推32k > > 请问8k训练的时候base就是1000000吗？ Hope this issue could be reopen.

如何从4k扩展到200k

同问，我看代码应该是用了ntk，因为base参数是500w，一般4k训练时用的都是1w吧？我感到疑惑的是没拓展context的模型的base参数也是500w，并且声称是外推4k->32k，这合理吗？如果长度只拓展8倍，为什么要设置base为500w（而不是大约8w）呢？

[Bug] ceval, cmmlu, mmlu 的 gen 对话模板行为不一致，mmlu 的对话模板存在问题

I (and most developers) hope the final prompt would be like, making chatml template as an example, ``` user 2+2=? assistant ``` The str in python is ```user\n2+2=?\nassistant\n``` If we...

Support Mixtral 8*7B MOE

Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. I apply this PR and the initial loss is quite high, which...

Support Mixtral 8*7B MOE

Hi, I fix a bug in my script and now the initial loss is normal *(around 2.3 in arxiv dataset). Thanks for your contribution! also, I have an extra question,...

How to set parameters to solve OOM！！！

Some tips (might be helpful) 1. Decrease `actor_rollout_ref.rollout.n` 2. Ensure the setting `export VLLM_ATTENTION_BACKEND=XFORMERS` 3. Decrease `actor_rollout_ref.actor.ppo_micro_batch_size` 4. Decrease `actor_rollout_ref.rollout.log_prob_micro_batch_size` and `actor_rollout_ref.ref.log_prob_micro_batch_size` 5. Decrease `data.max_response_length`