Wu Chengyue comments

Results 29 comments of


                                            Wu Chengyue

trafficstars

Question regarding the difference between llama-pro and the regular llama.（关于llama-pro和普通llama之间的区别的疑问）

感谢您的关注！“llama-pro是在llama的基础上添加了8层 identity block，同时进行了通用语料的全参数训练” 您可能在这里有点误解，我们添加8层identity block后并没有进行全参数训练，只是在后面的代码和数学预训练中训练新添加的8层identity block。您提出的实验我们会考虑的，我认为我们的方法区别于您所提出的这个setting在于，我们完全保留了原先LLaMA的参数，只训练了新增加的block，希望通过这个方法保留通用能力，并训练出一些即插即用的block（对于不同领域，您可以用同一个base model然后训练新的层）

Question regarding the difference between llama-pro and the regular llama.（关于llama-pro和普通llama之间的区别的疑问）

> 运行block_expansion.py后llama2_7b_hf模型出来的就是带额外扩展的block的model (假如称为llama_pro_8B)，然后可以全参训练新扩展的block 扩展后的带额外的block的model我们不进行全参数训练，只对新增加的block训练；新增加的block是即插即用的模块。例如可以对数学，代码领域训练新增加的1B参数量的blocks，也可以对其他领域训练新增加的1B参数量的blocks，基座都是原来的llama-7B

Question regarding the difference between llama-pro and the regular llama.（关于llama-pro和普通llama之间的区别的疑问）

> 想请问下，这里添加8层identity block后的预训练时，是只使用了增量的代码和数学数据吗？是的，我们用了[the-stack-dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup)中的python子集，还有[proof-pile-2](https://huggingface.co/datasets/EleutherAI/proof-pile-2)

Question regarding the difference between llama-pro and the regular llama.（关于llama-pro和普通llama之间的区别的疑问）

> 'block即插即用' 指的是 llama-7B.load_state_dict(blocks checkpoint) 的意思吗？我的理解应该是llama-8B.load_state_dict(base_model_ckpt + blocks checkpoint)

对比lora优势是什么

1. 根据我们的经验lora并不能很好的进行预训练，更适合在sft进行训练；我们的方法在预训练下训练收敛的比lora更好（参数量更多，学到的也更多），我会在第三点再分析一下预训练和sft 2. 我们没有刻意用通用数据进行混合，但是我们关注到现有的领域数据集本身有加入一些通用语料，比方说我们用到的proof-pile-2，这个数据集里面就有一些通用语料，只用这些语料我们也做到了避免遗忘 3. sft阶段我们也用了全部参数进行训练，我们也试过只对新增加的层训练，效果接近，用全参数训练是希望我们的方法能够兼容通用的训练pipeline；sft方面我认为可能对遗忘的影响没那么大，可以参考这篇文章：The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning. 我们认为sft的时候可能只是激发预训练学习到的知识，所以其实影响不大。这也是为什么lora可以很好的做sft，但是不适合做预训练

Question about Llama-7B and Llama-7B-Pro comparison.

We have not done this experiment yet. We may consider to do this later. Currently, we are going to do the expansion to Mistral and multi-modal models.

How to load the new model weight

I think you may need to revise the config, especially the key of "num_hidden_key" in the config.json file. You should use the number of layers after expansion for this key...

我们如何针对扩展区块微调?

感谢关注！我把训练代码上传了在这个repo下面了，也可以查看https://github.com/hills-code/open-instruct/tree/llama-pro

我们如何针对扩展区块微调?

这个项目是SFT的训练，在这个阶段是所有参数一起训练，和普通的SFT是一致的；在Pretrain的时候会冻结参数，具体的操作在这里https://github.com/hills-code/open-instruct/blob/7c2b14d3d319028c68657946ca2c16b248f866e8/open_instruct/customized_trainer.py#L53

Should I freeze norm.weight?

We freeze all the weights of the initial llama model and only train the newly added blocks.