Qwen icon indicating copy to clipboard operation
Qwen copied to clipboard

[BUG] <title>lora微调后,无法batch推理怎么办

Open TanXiang7o opened this issue 1 year ago • 4 comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

只用原模型是可以的,跟lora合并后无法batch推理,显示接收到两个pisitional参数

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

TanXiang7o avatar Nov 01 '23 18:11 TanXiang7o

可以提供下相关代码以便复现吗?主要是加载合并后模型的代码和batch推理部分的代码哈。

jklj077 avatar Nov 02 '23 02:11 jklj077

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import GenerationConfig
import sys
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType
from qwen_generation_utils import make_context, decode_tokens, get_stop_words_ids

tokenizer = AutoTokenizer.from_pretrained(
    'Qwen-7B-Chat',
    pad_token='<|extra_0|>',
    eos_token='<|endoftext|>',
    padding_side='left',
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    'Qwen-7B-Chat',
    pad_token_id=tokenizer.pad_token_id,
    device_map="auto",
    trust_remote_code=True
).eval()
peft_model_id = "lora-checkpoint-4000"
config = PeftConfig.from_pretrained(peft_model_id)
model = PeftModel.from_pretrained(model, peft_model_id)
model.generation_config = GenerationConfig.from_pretrained('/root/autodl-tmp/qwen/Qwen-7B-Chat', pad_token_id=tokenizer.pad_token_id)

all_raw_text = ["我想听你说爱我。", "今天我想吃点啥,甜甜的,推荐下", "我马上迟到了,怎么做才能不迟到"]
batch_raw_text = []
for q in all_raw_text:
    raw_text, _ = make_context(
        tokenizer,
        q,
        system="You are a helpful assistant.",
        max_window_size=model.generation_config.max_window_size,
        chat_format=model.generation_config.chat_format,
    )
    batch_raw_text.append(raw_text)

batch_input_ids = tokenizer(batch_raw_text, padding='longest')
batch_input_ids = torch.LongTensor(batch_input_ids['input_ids']).to(model.device)
batch_out_ids = model.generate(
    batch_input_ids,
    return_dict_in_generate=False,
    generation_config=model.generation_config
)
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]

batch_response = [
    decode_tokens(
        batch_out_ids[i][padding_lens[i]:],
        tokenizer,
        raw_text_len=len(batch_raw_text[i]),
        context_length=(batch_input_ids[i].size(0)-padding_lens[i]),
        chat_format="chatml",
        verbose=False,
        errors='replace'
    ) for i in range(len(all_raw_text))
]
print(batch_response)

response, _ = model.chat(tokenizer, "我想听你说爱我。", history=None)
print(response)

response, _ = model.chat(tokenizer, "今天我想吃点啥,甜甜的,推荐下", history=None)
print(response)

response, _ = model.chat(tokenizer, "我马上迟到了,怎么做才能不迟到", history=None)
print(response)

在官方batch_infer代码中仅修改了模型代码,额外加载了lora模型

可以提供下相关代码以便复现吗?主要是加载合并后模型的代码和batch推理部分的代码哈。

TanXiang7o avatar Nov 02 '23 03:11 TanXiang7o

{ "train_batch_size": "auto", "train_micro_batch_size_per_gpu" :"auto", "gradient_accumulation_steps": "auto", "gradient_clipping": 1.0, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "stage3_gather_16bit_weights_on_model_save": true }, "flops_profiler": { "enabled": true, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } }

不要开启cpu offload 使用torch的optimizer即可

ericzhou571 avatar Nov 11 '23 05:11 ericzhou571

@tx-cslearn 请问最新代码还有这个问题吗?如果仍有问题的话,请提供下error log哈

jklj077 avatar Dec 12 '23 05:12 jklj077

In your case the workaround would be naming the argument, model.generate(input_ids=inputs.input_ids).

livmortis avatar Apr 23 '24 07:04 livmortis