vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Is there a way to terminate vllm.LLM and release the GPU memory

Open sfc-gh-zhwang opened this issue 1 year ago • 25 comments

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)

sfc-gh-zhwang avatar Dec 04 '23 00:12 sfc-gh-zhwang

After below code, is there an api(maybe like llm.terminate) to kill llm and release the GPU memory?

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)

Please check the codes below. It works.

import gc

import torch
from vllm import LLM, SamplingParams
from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

# Load the model via vLLM
llm = LLM(model=model_name, download_dir=saver_dir, tensor_parallel_size=num_gpus, gpu_memory_utilization=0.70)

# Delete the llm object and free the memory
destroy_model_parallel()
del llm
gc.collect()
torch.cuda.empty_cache()
torch.distributed.destroy_process_group()
print("Successfully delete the llm pipeline and free the GPU memory!")

Best regards,

Shuyue Dec. 3rd, 2023

SuperBruceJia avatar Dec 04 '23 00:12 SuperBruceJia

mark

hijkzzz avatar Dec 04 '23 01:12 hijkzzz

Even after executing the code above, the GPU memory is not freed with the latest vllm built from source. Any recommendations?

deepbrain avatar Feb 08 '24 22:02 deepbrain

Are there any updates on this? the above code does not work for me either

huylenguyen avatar Feb 24 '24 23:02 huylenguyen

+1

puddingfjz avatar Mar 01 '24 15:03 puddingfjz

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?

puddingfjz avatar Mar 01 '24 16:03 puddingfjz

+1

shyringo avatar Apr 23 '24 09:04 shyringo

I find that we need to explicitly run "del llm.llm_engine.driver_worker" to release in when using a single worker. Can anybody explain why this is the case?

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.


But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

shyringo avatar Apr 24 '24 07:04 shyringo

I tried the above code block and also this line "del llm.llm_engine.driver_worker". Both failed for me.

But I managed, with the following code, to terminate the vllm.LLM(), release the GPU memory, and shut down ray in convenience for using vllm.LLM() for the next model. After this, I succeeded in using vllm.LLM() again for the next model.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

Anyway, even if it works, it is just a temporary solution and this issue still needs fixing.

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

shyringo avatar Apr 24 '24 09:04 shyringo

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

ticoneva avatar Apr 25 '24 10:04 ticoneva

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

thx a lot

rbao2018 avatar May 04 '24 12:05 rbao2018

vLLM seems to hang to the first allocated LLM() instance. It does not hang to later instances. Maybe that helps with diagnosing the issue?

from vllm import LLM


def show_memory_usage():
    import torch.cuda
    import torch.distributed
    import gc

    print(f"cuda memory: {torch.cuda.memory_allocated()//1024//1024}MB")
    gc.collect()
    # torch.distributed.destroy_process_group()
    torch.cuda.empty_cache()
    print(f"  --> after gc: {torch.cuda.memory_allocated()//1024//1024}MB")


def gc_problem():
    show_memory_usage()
    print("loading llm0")
    llm0 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=180)
    del llm0
    show_memory_usage()

    print("loading llm1")
    llm1 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=500)
    del llm1
    show_memory_usage()

    print("loading llm2")
    llm2 = LLM(model="facebook/opt-125m", num_gpu_blocks_override=600)
    del llm2
    show_memory_usage()

gc_problem()
root@c09a058c2d5b:/workspaces/aici/py/vllm# python tests/core/block/e2e/gc_problem.py |grep -v INFO
cuda memory: 0MB
  --> after gc: 0MB
loading llm0
cuda memory: 368MB
  --> after gc: 368MB
loading llm1
cuda memory: 912MB
  --> after gc: 368MB
loading llm2
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
cuda memory: 961MB
  --> after gc: 368MB
root@c09a058c2d5b:/workspaces/aici/py/vllm# 

The llm1 consumes more than llm0 but you can see that the allocated memory stays at llm0 level.

mmoskal avatar May 08 '24 18:05 mmoskal

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

yudataguy avatar May 09 '24 12:05 yudataguy

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

shyringo avatar May 09 '24 12:05 shyringo

Tried this including ray.shutdown() but the memory is not released on my end, any other suggestion?

could try the "del llm.llm_engine.model_executor" in the following code instead:

update: the following code would work better, without the possible dead lock warning.

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

did that as well, still no change in gpu memory allocation. Not sure how to go further

yudataguy avatar May 11 '24 01:05 yudataguy

In the latest version of vLLM destroy_model_parallel has moved to vllm.distributed.parallel_state. The objects you have to delete have also changed:

from vllm.distributed.parallel_state import destroy_model_parallel
...
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm # Isn't necessary for releasing memory, but why not
gc.collect()
torch.cuda.empty_cache()

We tried this in version 0.4.2, but GPU memory did not released.

zheyang0825 avatar May 11 '24 02:05 zheyang0825

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

shyringo avatar May 11 '24 06:05 shyringo

@zheyang0825 does adding this lines at the end make it work?

torch.distributed.destroy_process_group()         

mnoukhov avatar May 11 '24 16:05 mnoukhov

did that as well, still no change in gpu memory allocation. Not sure how to go further

Then I do not have a clue either. Meanwhile, I should add an information: the vllm version with which I succeeded with the above code was 0.4.0.post1

tried on 0.4.0.post1 and method worked, not sure what changed in the latest version that's not releasing the memory, possible bug?

yudataguy avatar May 12 '24 01:05 yudataguy

Hello ! so if I'm not wrong, no one achieved to release memory on vllm 0.4.2 yet ?

GurvanR avatar May 13 '24 14:05 GurvanR

A new bug was introduced in 0.4.2, but fixed in https://github.com/vllm-project/vllm/pull/4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

njhill avatar May 13 '24 14:05 njhill

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I updated vllm yesterday and still have the problem, I'm using those lines :

destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()

GurvanR avatar May 14 '24 09:05 GurvanR

This code is worked for me

vllm==0.4.0.post1

        import gc
        import ray
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel

        print('service stopping ..')
        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        destroy_model_parallel()

        del model.llm_engine.model_executor.driver_worker
        del model

        gc.collect()
        torch.cuda.empty_cache()
        ray.shutdown()

        print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")

        print("service stopped")

Misterrendal avatar May 14 '24 14:05 Misterrendal

There should be a built-in way! We cannot keep writing code that breaks on the next minor release :(

cassanof avatar May 16 '24 05:05 cassanof

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

youkaichao avatar May 16 '24 05:05 youkaichao

A new bug was introduced in 0.4.2, but fixed in #4737. Please try with that PR or as a workaround you can also install tensorizer.

This should resolve such errors at least for TP=1. For TP > 1, there may be other issues with creating a new LLM instance after deleting one in the same process.

I encountered this issue when TP = 8. I'm doing this in a iterative manner since I need to run the embedding model after the generative model so there are so loading / offloading. The first iteration is fine but the second iteration the instantiation of vllm ray server hangs.

Vincent-Li-9701 avatar May 20 '24 20:05 Vincent-Li-9701

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks .

I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

cassanof avatar May 21 '24 06:05 cassanof

I don't know if anyone can currently clear memory correctly, but in version 0.4.2, I applied the above code that failed to clear memory correctly. I can only use a slightly extreme method of creating a new process before calling and closing the process after calling to roughly solve the problem:

from multiprocessing import Process, set_start_method
set_start_method('spawn', force=True)
def vllm_texts(model_path):
    prompts=""
    sampling_params = SamplingParams(max_tokens=512)
    llm = LLM(model=model_path)
    outputs = llm.generate(prompts, sampling_params)

...
print(torch.cuda.memory_summary())
p = Process(target=vllm_texts, args=(model_path))
p.start()
p.join()
if p.is_alive():
    p.terminate()
p.close()
print(torch.cuda.memory_summary())
...

I still hope there is a way in the future to correctly and perfectly clear memory

DuZKai avatar May 31 '24 06:05 DuZKai

While I am using multiple GPUs to serve a LLM (tensor_parallel_size > 1), the GPUs' memory is not released, except the first GPU (cuda:0).

image

SuperBruceJia avatar Jun 11 '24 01:06 SuperBruceJia

In general it is very difficult to clean up all resources correctly, especially when we use multiple GPUs, and might be prone to deadlocks . I would say, the most stable way to terminate vLLM is to shut down the process.

I understand your point. However, this feature is extremely useful for situations where you need to switch between models. For instance, reinforcement learning loops. I am writing an off-policy RL loop, requiring me to train one model (target policy) while its previous version performs inference (behavior policy). As a result, I frequently load and unload models. While I know vLLM is not intended for training, using transformers would be too slow, making my technique unviable.

Let me know if this is a feature that's wanted and the team would be interested in maintaining it. I can open a separate issue and start working on it.

Glad to see you here @cassanof and to hear that you have been using vLLM in this kind of workflow!

Given how much wanted this feature seems to be, I will bring this back to the team to discuss! If multi-gpu instance is prone to deadlocks, then perhaps we can at least start with single-gpu instances. Everyone on the maintainer team does have limited bandwidth and we have a lot of things to work on, so contributions are very welcomed as always!

ywang96 avatar Jun 19 '24 06:06 ywang96