lmft 对Finetune后的效果有点疑问

想请问下作者公开的参数是训练了几个epoch的呀？感觉很奇怪，Finetune后回答格式会出现多个回车的情况，随便测了几个，感觉回答的也不是很好。想请问各位大佬，这种情况是因为Lora的能力不足？还是说通过加数据Finetune的方案无法让语言模型学到知识呢？

原回答：

Finetune后的回答：

Apr 10 '23 06:04 CoderAnn

CSC 27万条数据集，1个epoch，不过我看你的结果像是没有加载lora模型的结果，csc纠错的lora模型：https://huggingface.co/shibing624/chatglm-6b-csc-zh-lora

Apr 10 '23 07:04 shibing624

加载模型了：

Apr 10 '23 07:04 CoderAnn

pip install git+https://github.com/shibing624/lmft.git 安装开发中的版本，新功能还没release 到pip

Apr 10 '23 07:04 shibing624

好的我试试！谢谢

Apr 10 '23 07:04 CoderAnn

pip install git+https://github.com/shibing624/lmft.git 安装开发中的版本，新功能还没release 到pip

用这种方法安装推理报错：调用脚本：

Apr 10 '23 08:04 CoderAnn

pip install git+https://github.com/shibing624/lmft.git 安装开发中的版本，新功能还没release 到pip

用这种方法安装推理报错：调用脚本：

我在服务器测试时，也出现了这个错误，是有不兼容吗？ @shibing624 chatglm-6b在4月6号似乎稍微更新了他们的模型，

移除embedding中的image token以减小显存占用（需要更新模型文件pytorch_model-00001-of-00008.bin和pytorch_model-00008-of-00008.bin，感谢 [@silverriver](https://github.com/silverriver) 提出的想法）。去掉了对 icetk 的依赖（需要更新模型文件ice_text.model）。

我本地机器有4月4号下载的老模型，测试可以推理，但是因为显存存不足，加载时设置了args量化到8和开fp16。推理效果有点奇特，输出如下

['少先队员应该为老人让座。\n正确的写法应该是：少先队员应该为老人让座。其中，“因该”是错误的拼写，正确的写法应该是“应该”。同时，“老人”的拼写也不正确，正确的写法应该是“老年人”。“少先队员”的拼写正确。']

完整log

2023-04-10 13:46:45.331 | DEBUG    | lmft.chatglm_model:__init__:94 - Device: cuda
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%| | 8/8 [00:11<00:00,  1.40s/it]
2023-04-10 13:46:56.924 | DEBUG    | lmft.chatglm_model:__init__:106 - Quantized to 8 bit

Generating outputs:   0%|
/usr/local/lib/python3.9/dist-packages/transformers/tokenization_utils_base.py:717: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:245.)
  tensor = as_tensor(value)
Generating outputs: 100%| | 1/1 [01:05<00:00, 65.14s/it]
['少先队员应该为老人让座。\n正确的写法应该是：少先队员应该为老人让座。其中，“因该”是错误的拼写，正确的写法应该是“应该”。同时，“老人”的拼写也不正确，正确的写法应该是“老年人”。“少先队员”的拼写正确。']

推理代码

#!`/usr/bin/which python3`

from lmft import ChatGlmModel
glbpath = "/data/ChatGLM-6B/THUDM/chatglm-6b"
csc_lorapath = "/data/chatglm-6b-csc-zh-lora"
model = ChatGlmModel("chatglm", glbpath, lora_name=csc_lorapath, args={"quantization_bit": 8, "fp16": True})
r = model.predict(["对下面中文拼写纠错：\n少先队员因该为老人让坐。\n答："])
print(r) # ['少先队员应该为老人让座。\n错误字：因，坐']

Apr 10 '23 10:04 bash99

chatglm-6b 用最新的权重，我本地测试时 "quantization_bit": None, 没开量化，训练时也没用int8，我的环境是V100不支持int8，看你的结果应该lora还是没有起到作用，lora生效后，输出格式是严格按照 '少先队员应该为老人让座。\n错误字：输出的。

Apr 10 '23 12:04 shibing624

pip install git+https://github.com/shibing624/lmft.git 安装开发中的版本，新功能还没release 到pip

用这种方法安装推理报错：调用脚本：

我在服务器测试时，也出现了这个错误，是有不兼容吗？ @shibing624 chatglm-6b在4月6号似乎稍微更新了他们的模型，

移除embedding中的image token以减小显存占用（需要更新模型文件pytorch_model-00001-of-00008.bin和pytorch_model-00008-of-00008.bin，感谢 [@silverriver](https://github.com/silverriver) 提出的想法）。去掉了对 icetk 的依赖（需要更新模型文件ice_text.model）。

我本地机器有4月4号下载的老模型，测试可以推理，但是因为显存存不足，加载时设置了args量化到8和开fp16。推理效果有点奇特，输出如下

['少先队员应该为老人让座。\n正确的写法应该是：少先队员应该为老人让座。其中，“因该”是错误的拼写，正确的写法应该是“应该”。同时，“老人”的拼写也不正确，正确的写法应该是“老年人”。“少先队员”的拼写正确。']

完整log

2023-04-10 13:46:45.331 | DEBUG    | lmft.chatglm_model:__init__:94 - Device: cuda
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%| | 8/8 [00:11<00:00,  1.40s/it]
2023-04-10 13:46:56.924 | DEBUG    | lmft.chatglm_model:__init__:106 - Quantized to 8 bit

Generating outputs:   0%|
/usr/local/lib/python3.9/dist-packages/transformers/tokenization_utils_base.py:717: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:245.)
  tensor = as_tensor(value)
Generating outputs: 100%| | 1/1 [01:05<00:00, 65.14s/it]
['少先队员应该为老人让座。\n正确的写法应该是：少先队员应该为老人让座。其中，“因该”是错误的拼写，正确的写法应该是“应该”。同时，“老人”的拼写也不正确，正确的写法应该是“老年人”。“少先队员”的拼写正确。']

推理代码

#!`/usr/bin/which python3`

from lmft import ChatGlmModel
glbpath = "/data/ChatGLM-6B/THUDM/chatglm-6b"
csc_lorapath = "/data/chatglm-6b-csc-zh-lora"
model = ChatGlmModel("chatglm", glbpath, lora_name=csc_lorapath, args={"quantization_bit": 8, "fp16": True})
r = model.predict(["对下面中文拼写纠错：\n少先队员因该为老人让坐。\n答："])
print(r) # ['少先队员应该为老人让座。\n错误字：因，坐']

我和你的结果一样，一样开了fp16

Apr 11 '23 02:04 CoderAnn

chatglm-6b 用最新的权重，我本地测试时 "quantization_bit": None, 没开量化，训练时也没用int8，我的环境是V100不支持int8，看你的结果应该lora还是没有起到作用，lora生效后，输出格式是严格按照 '少先队员应该为老人让座。\n错误字：输出的。

但这个输出结果和chatglm-6b本身的输出结果也不一样，虽然可能存在抖动问题但是多次调用原模型没有输出这种格式。感觉lora应该还是生效了

Apr 11 '23 02:04 CoderAnn

chatglm-6b 用最新的权重，我本地测试时 "quantization_bit": None, 没开量化，训练时也没用int8，我的环境是V100不支持int8，看你的结果应该lora还是没有起到作用，lora生效后，输出格式是严格按照 '少先队员应该为老人让座。\n错误字：输出的。

但这个输出结果和chatglm-6b本身的输出结果也不一样，虽然可能存在抖动问题但是多次调用原模型没有输出这种格式。感觉lora应该还是生效了

对，特意去掉lora但是还是用lmft的方式加载了试试，输出是

['少先队员应该为老人让座。\n\n正确的拼音是："sòu bèi shǒu gèng jiā"，其中，“少先队员”的拼音是“sòu bèi shǒu gèng”,“老人”的拼音是“gèng jiā”。']

后面加载lora时仍然是同上的”RuntimeError: Expected 4-dimensional input for 4-dimensional weight [8192, 8, 1, 1], but got 3-dimensional input of size [1, 16, 4096] instead“ 错误代码如下，

#!/usr/bin/env python3

from lmft import ChatGlmModel
glbpath = "/DaTa/dl/chatglm-6b"
csc_lorapath = "/DaTa/dl/chatglm-6b-csc-zh-lora"

model = ChatGlmModel("chatglm", glbpath)
r = model.predict(["对下面中文拼写纠错：\n少先队员因该为老人让坐。\n答："])
print(r) # ['少先队员应该为老人让座。\n错误字：因，坐']

model = ChatGlmModel("chatglm", glbpath, lora_name=csc_lorapath)
r = model.predict(["对下面中文拼写纠错：\n少先队员因该为老人让坐。\n答："])
print(r) # ['少先队员应该为老人让座。\n错误字：因，坐']

model = ChatGlmModel("chatglm", glbpath, lora_name=csc_lorapath, args={"quantization_bit": 8, "fp16": True})
r = model.predict(["对下面中文拼写纠错：\n少先队员因该为老人让坐。\n答："])
print(r) # ['少先队员应该为老人让座。\n错误字：因，坐']

Apr 11 '23 03:04 bash99

eos token 不一致fixed： https://huggingface.co/THUDM/chatglm-6b/commit/aa51e62ddc9c9f334858b0af44cf59b05c70148a

Apr 11 '23 08:04 shibing624

lmft lmft copied to clipboard

对Finetune后的效果有点疑问

lmft
lmft copied to clipboard