LongWriter
LongWriter copied to clipboard
ValueError when running LongWriter-GLM4-9B with vLLM on Kaggle T4 GPUs
System Info / 系統信息
Python version: 3.10 Hardware: 2x NVIDIA T4 GPUs (Kaggle environment) CUDA: Latest available on Kaggle vLLM: Latest (installed via pip with --force-reinstall) Model: THUDM/LongWriter-glm4-9b Platform: Kaggle Notebooks
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
- [ ] The official example scripts / 官方的示例脚本
- [X] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
The error occurs when trying to run LongWriter-GLM4-9B with tensor parallelism across 2 T4 GPUs. Here's the minimal reproduction code:
from vllm import LLM, SamplingParams
model = LLM(
model="THUDM/LongWriter-glm4-9b",
dtype="half",
trust_remote_code=True,
tensor_parallel_size=2,
max_model_len=32768,
gpu_memory_utilization=1,
)
tokenizer = model.get_tokenizer()
stop_token_ids = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"), tokenizer.get_command("<|observation|>")]
generation_params = SamplingParams(
temperature=0.5,
top_p=0.8,
top_k=50,
max_tokens=32768,
repetition_penalty=1,
stop_token_ids=stop_token_ids
)
query = "Write a 10000-word China travel guide"
input_ids = tokenizer.build_chat_input(query, history=[], role='user').input_ids[0].tolist()
outputs = model.generate(
sampling_params=generation_params,
prompt_token_ids=[input_ids],
)
Error message:
ValueError: could not broadcast input array from shape (513,) into shape (512,)
Full traceback shows the error occurs in /vllm/attention/backends/utils.py, line 215, during the attention metadata building process.
Expected behavior / 期待表现
The model should:
- Successfully load and initialize across both T4 GPUs using tensor parallelism
- Accept the input prompt and generate text using the specified parameters
- Handle the attention mechanism correctly without shape mismatches