self-rag The result of direct inference without using VLLM is wrong, is it a problem with the model?

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, GenerationConfig
# from vllm import LLM, SamplingParams
import torch
device = torch.device(0)


def load_tokenizer_and_model():
  tokenizer = AutoTokenizer.from_pretrained('/root/autodl-tmp/selfrag_llama2_7b')
  config = AutoConfig.from_pretrained('/root/autodl-tmp/selfrag_llama2_7b')
  model = AutoModelForCausalLM.from_pretrained(
    '/root/autodl-tmp/selfrag_llama2_7b',
    torch_dtype=torch.float16,
    config=config
  )

  model.to(device)
  model.eval()
  return tokenizer, model

def format_prompt(input, paragraph=None):
  prompt = "### Instruction:\n{0}\n\n### Response:\n".format(input)
  if paragraph is not None:
    prompt += "[Retrieval]<paragraph>{0}</paragraph>".format(paragraph)
  return prompt

 if  __name__ == "__main__":
  query_1 = "Leave odd one out: twitter, instagram, whatsapp."
  query_2 = "Can you tell me the difference between llamas and alpacas?"
  queries = [query_1, query_2]
  tokenizer, model = load_tokenizer_and_model()

  for q in queries:
    # inputs = tokenizer([format_prompt(query) for query in queries], return_tensors='pt')
    inputs = tokenizer(format_prompt(q), return_tensors='pt')
    input_ids = inputs['input_ids'].to(device)

    generation_config = GenerationConfig(
      temperature=0.0,
      top_p=1.0,
      max_tokens=100
    )
    with torch.no_grad():
      generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        repetition_penalty=1.2,
      )
    output = generation_output.sequences[0]
    output = tokenizer.decode(output, skip_special_tokens=True)
    print(output)

"""
'### Instruction:
Leave odd one out: twitter, instagram, whatsapp.

### Response:
Tw'


'### Instruction:
Can you tell me the difference between llamas and alpacas?

### Response:
S'
"""

Dec 18 '23 02:12 lizhongv

Thank you for reporting! Did the model work okay with vllm? If so, the issue might be from the libraries. When we were working on earlier versions of Self-RAG back in June, I had multiple issues related to inconsitent predictions between vllm and transformers (e.g.., transformers batch decoding with Llama2 had some issues, or vllm predictions aren't exactly same as transformers when they should match). For those cases, it might be better to check some open issues in vllm or transformers.

Dec 21 '23 04:12 AkariAsai

我也遇到了这个问题，非常的奇怪。请问有解决的思路吗？

Jan 02 '24 09:01 wanghao-007

I had the same problem, and it was very strange. Is there a solution?

Jan 02 '24 09:01 wanghao-007