Firefly icon indicating copy to clipboard operation
Firefly copied to clipboard

Qwen-7B微调后的推理问题

Open aiquanpeng opened this issue 1 year ago • 14 comments

微信图片_20230807154312

aiquanpeng avatar Aug 07 '23 07:08 aiquanpeng

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:151643 for open-end generation. A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.

aiquanpeng avatar Aug 07 '23 07:08 aiquanpeng

遇到了同样的问题,请问解决了么

suwief avatar Aug 08 '23 04:08 suwief

结果全是乱的,答案都不对,坐等一个解决方案

FeiWard avatar Aug 08 '23 09:08 FeiWard

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:151643 for open-end generation. A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.

这个warning可以不用理会,应该不影响生成。我用这个项目训练的模型,生成结果看起来好像还行 image

yangjianxin1 avatar Aug 08 '23 12:08 yangjianxin1

same warning: The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:151643 for open-end generation. A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.

Modas-Li avatar Aug 09 '23 01:08 Modas-Li

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:151643 for open-end generation. A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.

这个warning可以不用理会,应该不影响生成。我用这个项目训练的模型,生成结果看起来好像还行 image

cb78bebd9995fc60391994031ff2f9f 这是我的结果,是乱的

suwief avatar Aug 09 '23 07:08 suwief

训练轮数太大,数据量太少导致,降低训练轮数

aiquanpeng avatar Aug 09 '23 07:08 aiquanpeng

训练轮数太大,数据量太少导致,降低训练轮数

巡礼那轮数设置的1,使用的数据量确实太少,示例中的那些,这个会导致大模型混乱么?

suwief avatar Aug 09 '23 08:08 suwief

遇到了同样的问题, 使用Firefly/script/chat/single_chat.py 脚本 加载最原始的Qwen-7b-chat模型输出也是乱的, image

qiuwenbogdut avatar Aug 22 '23 07:08 qiuwenbogdut

可以参考使用一下下面这段代码, 亲测有效


def main():
    model_name = 'Qwen-7B-Chat'
    # model_name = 'YeungNLP/firefly-baichuan-7b'
    # model_name = 'YeungNLP/firefly-ziya-13b'
    # model_name = 'YeungNLP/firefly-bloom-7b1'
    # model_name = 'YeungNLP/firefly-baichuan-7b'
    # model_name = 'YeungNLP/firefly-baichuan-13b'
    # model_name = 'YeungNLP/firefly-bloom-7b1'
    # model_name = 'YeungNLP/firefly-llama-30b'

    max_new_tokens = 500
    top_p = 0.9
    temperature = 0.35
    repetition_penalty = 1.0
    device = 'cuda'
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map='auto'
    ).to(device).eval()
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True,
        # llama不支持fast
        use_fast=False if model.config.model_type == 'llama' else True
    )
    # QWenTokenizer比较特殊,pad_token_id、bos_token_id、eos_token_id均为None。eod_id对应的token为<|endoftext|>
    if tokenizer.__class__.__name__ == 'QWenTokenizer':
        tokenizer.pad_token_id = tokenizer.eod_id
        tokenizer.bos_token_id = tokenizer.eod_id
        tokenizer.eos_token_id = tokenizer.eod_id

    text = input('User:')
    while True:
        text = text.strip()
        # chatglm使用官方的数据组织格式
        if model.config.model_type == 'chatglm':
            text = '[Round 1]\n\n问:{}\n\n答:'.format(text)
            input_ids = tokenizer(text, return_tensors="pt", add_special_tokens=False).input_ids.to(device)
        # 为了兼容qwen-7b,因为其对eos_token进行tokenize,无法得到对应的eos_token_id
        else:
            im_start_tokens = [tokenizer.im_start_id]
            im_end_tokens = [tokenizer.im_end_id]
            nl_tokens = tokenizer.encode("\n")

            def _tokenize_str(role, content):
                return f"{role}\n{content}", tokenizer.encode(
                    role, allowed_special=set()
                ) + nl_tokens + tokenizer.encode(content, allowed_special=set())
            
            system_text, system_tokens_part = _tokenize_str("system", "")

            system_tokens = im_start_tokens + system_tokens_part + im_end_tokens

            context_tokens = []
            context_tokens = system_tokens + context_tokens

            context_tokens += (
                nl_tokens
                + im_start_tokens
                + _tokenize_str("user", text)[1]
                + im_end_tokens
                + nl_tokens
                + im_start_tokens
                + tokenizer.encode("assistant")
                + nl_tokens
            )
            input_ids = torch.tensor([context_tokens]).to(device)

        with torch.no_grad():
            outputs = model.generate(
                input_ids=input_ids, max_new_tokens=max_new_tokens, do_sample=True,
                top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty,
                eos_token_id=tokenizer.eos_token_id
            )
        outputs = outputs.tolist()[0][len(input_ids[0]):]
        response = tokenizer.decode(outputs)
        response = response.strip().replace(tokenizer.eos_token, "").strip()
        print("Firefly:{}".format(response))
        text = input('User:')


if __name__ == '__main__':
    main()


`

qiuwenbogdut avatar Aug 22 '23 08:08 qiuwenbogdut

遇到了同样的问题, 使用Firefly/script/chat/single_chat.py 脚本 加载最原始的Qwen-7b-chat模型输出也是乱的, image

因为每个项目或者模型都有其自己的数据组织格式,需要使用官方的推理脚本。firefly里面的推理代码,只支持使用firefly项目训练的模型。

yangjianxin1 avatar Aug 22 '23 08:08 yangjianxin1

遇到了同样的问题, 使用Firefly/script/chat/single_chat.py 脚本 加载最原始的Qwen-7b-chat模型输出也是乱的, image

因为每个项目或者模型都有其自己的数据组织格式,需要使用官方的推理脚本。firefly里面的推理代码,只支持使用firefly项目训练的模型。

感谢解答,谢谢

qiuwenbogdut avatar Aug 22 '23 08:08 qiuwenbogdut

还是么用呢 这个是不是灾难性遗忘

JOY-SWang avatar Jan 27 '24 09:01 JOY-SWang

还是么用呢 这个是不是灾难性遗忘

不是,灾难性遗忘是没办法回答源域知识,输出的内容都是SFT阶段的语料回答,而不是胡乱回答;上面的问题看起来更像是推理阶段对特殊符eos, pos解析出错导致的,就像前面提到的推理代码中数据组织格式不一致导致的,需要自己debug或者使用源项目推理代码

Vincent-ch99 avatar Mar 01 '24 06:03 Vincent-ch99