vllm
vllm copied to clipboard
[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq)
Your current environment
sagemaker ml.g5.12xlarge instance (4 instances of a10g 24gb) container is 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 from https://github.com/aws/deep-learning-containers/blob/master/available_images.md
🐛 Describe the bug
from vllm import LLM, SamplingParams question = "what is the id of the team and what is the subtitute lineup of the home team for the match?" history = str(["how many games the home team Sevilla won?"]) full_example = f""" \n \n\nYou are a transformation helper specialist, based on the history helping in transforming user input\nto a more structured and simpler text to a smaller model, which is less smart as you. \n\nMost of the times, the history could help you about entities which are now missing \nfrom the question\nTo illustrate the mission, if the user asked in the history about an entity (like 'Barcelona'), and \nnow he asked about 'team' (could be team, player, or other entity) or it seems to you that the an entity \nis missing in the context, perhaps the entity ('Barcelona') from the history could be the option to fill the gap. \n\nIf there is no entity in the history, please do not hallucinate and offer weird entity, for example if in the history\nyou saw 'home team' and now he just mentioned 'team', replace 'team' with 'home team' (applicable for away team too).\n\nWhen a replacement is occurred, please do not add 'the' as part of the entity, just entity itself.\n\n \nReturn it in valid JSON format according to the schema\n \n\n User question: \n\n {question} \n\n History: \n\n {history}"""
prompts = [ full_example ] sampling_params = SamplingParams(temperature=0, top_p=0.95, max_new_tokens=256)
llm = LLM(model="TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ", 'tokenizer_mode'='auto', 'gpu_memory_utilization'=0.7, 'guided_decoding_backend' ='lm-format-enforcer', tensor_parallel_size=4)
quantization not mentioned in order to get marlin kernels instead of standard gptq (if available).
outputs = llm.generate(prompts, sampling_params)
Print the outputs.
for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
the output for version 0.4.1 is right "what is the id of Sevilla and what is the substitute lineup of Sevilla for the match?"
for all versions forward (0.4.2 and above) it returns weird answers (which does not make sense. like 1900 1900 1900 1900 and weird tokens in the answers)
Notes: 1.As a side check I ran everything the same, but with TechxGenus/Meta-Llama-3-8B-Instruct-AWQ, and it worked. As I suspect, when the shard is above the size of the gpu memory (24gb not enough for 70b gptq, compared to 8b awq), and a tp is needed, it does affect (maybe in v0.4.2 and forward you changed something in the code of the tensor parallel).
- As another note, ran the same llama 70b gptq on 1 card v6000 48gb, everything worked great even with v0.5.1 (ofc 0.4.2 and up), so it almost 100% something with the megatron TP.
If another data needed, just comment and mention me. Thanks