fastllm 百川模型转换问题

ValueError: Can't find 'adapter_config.json' at 'hiyouga/baichuan-7b-sft'

Jul 08 '23 13:07 liaoweiguo

尽量贴出你使用的操作系统，系统环境，编译器版本以及转换过程。

Jul 08 '23 16:07 wildkid1024

ubuntu 20.04 python 3.10.9 CUDA Version: 11.6 A100 * 2 gcc version 7.5.0 转换命令： python3 tools/baichuan_peft2flm.py baichuan-fp32.flm

Jul 09 '23 01:07 liaoweiguo

注意查看sft的版本，是否下载了adapter文件。

Jul 09 '23 13:07 wildkid1024

@ztxz16 百川现在是不是因为协议没法支持了？

Jul 10 '23 11:07 wildkid1024

@ztxz16 百川现在是不是因为协议没法支持了？

这个百川的SFT模型代码结构好像改掉了，现在可能转不了了

有段时间没关注了.. 不知道现在百川现在哪个SFT模型效果比较好

我是想等官方的Chat模型出来之后再转个模型传huggingface

Jul 10 '23 12:07 ztxz16

这个百川的SFT模型代码结构好像改掉了，现在可能转不了了

有段时间没关注了.. 不知道现在百川现在哪个SFT模型效果比较好

我是想等官方的Chat模型出来之后再转个模型传huggingface

@ztxz16 百川13B出来了，官方也放出来chat版本了 https://github.com/baichuan-inc/Baichuan-13B

Jul 11 '23 01:07 ray-008

坐等SFT！😍 @ztxz16

Jul 17 '23 04:07 heavenkiller2018

这个百川的SFT模型代码结构好像改掉了，现在可能转不了了有段时间没关注了.. 不知道现在百川现在哪个SFT模型效果比较好我是想等官方的Chat模型出来之后再转个模型传huggingface

@ztxz16 百川13B出来了，官方也放出来chat版本了 https://github.com/baichuan-inc/Baichuan-13B

嗯嗯，目前chat版本应该可以跑了，就是转模型的时候对内存要求比较高

Jul 18 '23 09:07 ztxz16

这个百川的SFT模型代码结构好像改掉了，现在可能转不了了有段时间没关注了.. 不知道现在百川现在哪个SFT模型效果比较好我是想等官方的Chat模型出来之后再转个模型传huggingface

@ztxz16 百川13B出来了，官方也放出来chat版本了 https://github.com/baichuan-inc/Baichuan-13B

嗯嗯，目前chat版本应该可以跑了，就是转模型的时候对内存要求比较高

我有2张40G显存的A100，执行 python3 tools/baichuan2flm.py baichuan-fp32.flm 去转换 baichuan-13b-chat 报错

root@720f636f2838:/home/user/code/build# python3 tools/baichuan2flm.py baichuan-fp32.flm
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:11<00:00, 23.98s/it]
Traceback (most recent call last):
  File "/home/user/code/build/tools/baichuan2flm.py", line 17, in <module>
    torch2flm.tofile(exportPath, model, tokenizer);
  File "/home/user/code/build/tools/fastllm_pytools/torch2flm.py", line 78, in tofile
    cur = dict[key].numpy().astype(np.float32)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

根据上面的错误提示，修改 fastllm_pytools/torch2flm.py 的第78行代码：由cur = dict[key].numpy().astype(np.float32) 改为 cur = dict[key].cpu().numpy().astype(np.float32) 之后就正常了，能导出一个50G的 baichuan-fp32.flm文件。

但是执行./quant -p baichuan-fp32.flm -o baichuan-int8.flm -b 8量化时候又报错:

root@720f636f2838:/home/user/code/build# ./quant -p baichuan-fp32.flm -o baichuan-int8.flm -b 8
Load (283 / 283) 
Warmup...
status = 7
1 1 128
Error: cublas error.
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)

下面是显卡使用状态：

[root@nqy-prod-gpu-node-2 code]# nvidia-smi 
Wed Jul 19 17:48:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:00:0A.0 Off |                    0 |
| N/A   55C    P0    46W / 250W |  40249MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   46C    P0    36W / 250W |      3MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     59451      C   ./quant                         40246MiB |
+-----------------------------------------------------------------------------+

我注意到有一张显卡显存已经使用满了，模型有50G，而我的单卡只有40G，是因为显存不足吗？还是什么原因呢？

Jul 19 '23 09:07 ray-008

这个百川的SFT模型代码结构好像改掉了，现在可能转不了了有段时间没关注了.. 不知道现在百川现在哪个SFT模型效果比较好我是想等官方的Chat模型出来之后再转个模型传huggingface

@ztxz16 百川13B出来了，官方也放出来chat版本了 https://github.com/baichuan-inc/Baichuan-13B

嗯嗯，目前chat版本应该可以跑了，就是转模型的时候对内存要求比较高

我有2张40G显存的A100，执行 python3 tools/baichuan2flm.py baichuan-fp32.flm 去转换 baichuan-13b-chat 报错

root@720f636f2838:/home/user/code/build# python3 tools/baichuan2flm.py baichuan-fp32.flm
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:11<00:00, 23.98s/it]
Traceback (most recent call last):
  File "/home/user/code/build/tools/baichuan2flm.py", line 17, in <module>
    torch2flm.tofile(exportPath, model, tokenizer);
  File "/home/user/code/build/tools/fastllm_pytools/torch2flm.py", line 78, in tofile
    cur = dict[key].numpy().astype(np.float32)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

根据上面的错误提示，修改 fastllm_pytools/torch2flm.py 的第78行代码：由cur = dict[key].numpy().astype(np.float32) 改为 cur = dict[key].cpu().numpy().astype(np.float32) 之后就正常了，能导出一个50G的 baichuan-fp32.flm文件。

但是执行./quant -p baichuan-fp32.flm -o baichuan-int8.flm -b 8量化时候又报错:

root@720f636f2838:/home/user/code/build# ./quant -p baichuan-fp32.flm -o baichuan-int8.flm -b 8
Load (283 / 283) 
Warmup...
status = 7
1 1 128
Error: cublas error.
terminate called after throwing an instance of 'char const*'
Aborted (core dumped)

下面是显卡使用状态：

[root@nqy-prod-gpu-node-2 code]# nvidia-smi 
Wed Jul 19 17:48:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:00:0A.0 Off |                    0 |
| N/A   55C    P0    46W / 250W |  40249MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   46C    P0    36W / 250W |      3MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     59451      C   ./quant                         40246MiB |
+-----------------------------------------------------------------------------+

我注意到有一张显卡显存已经使用满了，模型有50G，而我的单卡只有40G，是因为显存不足吗？还是什么原因呢？

嗯，是的，fastlllm暂时只能用一张显卡我之前是试的用from_hf来创建int4或者int8模型或者你也可以用-DUSE_CUDA=OFF编译一下，这样可以用内存来量化

Jul 19 '23 09:07 ztxz16

用-DUSE_CUDA=OFF编译之后，量化成功了，得到了一个baichuan-int4.flm文件。

但是推理的时候报错Segmentation fault (core dumped)

from fastllm_pytools import llm
model_path = '/home/user/code/build/cbaichuan-int4.flm'
model = llm.model(model_path)
for response in model.stream_response('你好'):
    print(response, flush=True, end="")

Jul 19 '23 10:07 ray-008

fastllm fastllm copied to clipboard

百川模型转换问题

fastllm
fastllm copied to clipboard