smallmocha

Results 4 comments of smallmocha

Same issue here,anyone fix it now?

@boydfd seems did not fix this issue,not when load model,i get oom after runing several days

seem due to CUDA graph,no memory leak when enforce-eager=True

you should load model outside the function to keep model only load once from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=2, trust_remote_code=True, load_format="pt") def process_prompts(prompts): sampling_params = SamplingParams(temperature=0.0,...