smallmocha comments

Repositories
Issues
Comments

Results 4 comments of


                                            smallmocha

ray OOM in tensor parallel

@boydfd seems did not fix this issue，not when load model，i get oom after runing several days

memory leak in v0.2.7

seem due to CUDA graph，no memory leak when enforce-eager=True

vllm hangs when reinitializing ray

you should load model outside the function to keep model only load once from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=2, trust_remote_code=True, load_format="pt") def process_prompts(prompts): sampling_params = SamplingParams(temperature=0.0,...