Wu Chengyue

Results 29 comments of Wu Chengyue
trafficstars

1. You do not need pytorch_model.bin.index.json. For the other necessary files, you can just copy the original base model. 2. The code can directly load the dataset from the huggingface...

Hi! Have you tried to directly finetune llama-3-8B-instruct? What will happen in this setting? I did not carry out the experiments with llama-3 so maybe I am not very familiar...

Certainly! Here is the link to Yi-9B https://huggingface.co/01-ai/Yi-9B and its tech report https://arxiv.org/pdf/2403.04652 You can find the depth upscaling in the Sec 7.3 ![image](https://github.com/TencentARC/LLaMA-Pro/assets/60053707/9021206c-7192-43e4-bd42-05d3ea9b0833) and LLaMa3-120B https://huggingface.co/alpindale/goliath-120b

Yes, you can directly finetune the 8B model with any datasets. You can access the model in the huggingface (https://huggingface.co/TencentARC/LLaMA-Pro-8B). You can use it just like the normal LLaMA model.

It is expected to be released by this month! Thanks for your attention!

Thanks for your attention! I think the main difference between our work and PEFT methods is that we scale the parameters. We have observed the power of scaling like GPT,...

这取决于您添加的层数,以及训练的设置,根据我的经验8卡A100-40G是能够支持ctx-length=4096的预训练的,我试过将LoRA的rank调大到1024,使得lora和我们可训练的参数量相近,此时显存占用也是差不多的

是的,但是如果新增加的要训练的层很多,同样也会带来很大的显存占用,并且训练的时候其实原有模型的参数也需要load进去,尽管不需要微调

我们也在探索更大的模型,不过这样的实验很需要资源,目前为止我们探索了在不同架构,如mistral上的扩展,取得了一定的效果,如[Mistral-Pro](https://huggingface.co/TencentARC/Mistral_Pro_8B_v0.1),后续我们也会进一步探索这方面的idea。我们发现yi也最近使用深度扩展进行了数学代码的训练,[Yi-9B](https://www.qbitai.com/2024/03/126184.html),他扩展了16层,我相信复制的位置,复制的层数,还是有很多值得研究的地方,我们会逐步研究的。