macheng6

Results 16 issues of macheng6

### Feature request using flash attention to speed up ### Motivation none ### Your contribution none

我想知道,moss训练过程中一下问题: 1. moss是选择哪个模型作为初始化参数(backbone)的? 2. moss训练过程中用到了哪些优化显存的方法?

I want to know how to avoid oom when fine-tuning the 20B model, only fp16?

### System Info The dp mode of 4.29.0 seems to have a bug. When forwarding, the dtype of the model will be changed to torch.int64, which will cause the torch.finfo...

1.为什么predict的时候没有加linear映射到词表维度,而是直接与word_embeddings相乘映射到词表维度了。 2.GLM加载使用AutoModelForSeq2SeqLM,而没有使用AutoModelForCausualLM,原因是什么?

### Describe the issue In the LLMLingua project, I attempted to use the qwen model instead of modelname and oai_tokenzier in the code, when the targte token is 150, the...

bug
question