smallmocha
smallmocha
Same issue here,anyone fix it now?
@boydfd seems did not fix this issue,not when load model,i get oom after runing several days
seem due to CUDA graph,no memory leak when enforce-eager=True
you should load model outside the function to keep model only load once from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=2, trust_remote_code=True, load_format="pt") def process_prompts(prompts): sampling_params = SamplingParams(temperature=0.0,...