Yushi Bai
Yushi Bai
看起来你的测试结果和我们测试结果出入不大,我认为这基本在随机误差之内。请问你的截断方式是什么样的呢?
Hi! We refer to section 4.2 of our paper for details of DPO. We use the same codebase as [ChatGLM-RLHF](https://arxiv.org/abs/2404.00934). We currently do not have plan to release the code...
Hi, our DPO code is based on Megatron-LM.
这里的`shift_weights`已经经过归一化了。每个sample的weight加起来为1。
The GPT-3.5-Turbo-16k model evaluated in our paper has already been deprecated. You can try gpt-3.5-turbo-0125 (16k), or the most recent gpt-4o-mini (128k), according to OpenAI (https://platform.openai.com/docs/models).
Right. We didn't provide code for evaluating API models. You can modify the [get_pred()](https://github.com/THUDM/LongBench/blob/main/pred.py#L51) fucntion to do so.
1. 建议在已经经过长度扩展的base模型上做Long Context Alignment微调(SFT, DPO) 2. 显存占用取决于序列长度,比如我们论文中64k长度开zero3训练需要80G显存 3. 如果你的base模型已经在更长的序列上加训过(长度扩展)则只微调即可,否则需要先做加训
Hi, I suspect a misalignment in the chat prompt template, but I'm not sure how client.chat deal with the chat template. Can you provide more details?
你好,Long (>128k) 只是评测数据的一个subset,代表所有测试数据中长度大于 128k token 的数据集合。在所有数据上的评测我们都是用的`--max_model_len 131072`,对于超过 128k token 的序列作截断。
hi, please update to the newest [trans_web_demo.py](https://github.com/THUDM/LongWriter/blob/main/trans_web_demo.py).