OpenLLM bug: Memory usage is increased with each request

Describe the bug

I'm trying to create embeddings for some documents with langchain and openllm. With each request the GPU RAM consumption increases by some hundred MB until OpenLLM crashes because OOM error. On startup the openllm process allocates around 8GiB of GPU Ram. The max. RAM of the GPU is 16GiB.

To reproduce

Server side: openllm start llama --model-id meta-llama/Llama-2-7b-chat-hf --quantize int8

Client Side:

class OpenLLMEmbeddings(Embeddings):

    def __init__(self, client):
        self.client = client

    def embed_documents(self, texts):
        return [ self.embed_query(text) for text in texts ]

    def embed_query(self, text: str):
        return self.client.embed(text).embeddings[0]

openllm_client = openllm.client.HTTPClient('http://some.server.local:3000')
embeddings = OpenLLMEmbeddings(openllm_client)
pages = confluence_loader.load(...)
index = FAISS.from_documents(pages, embeddings) <--crash

Logs

2023-08-19T15:49:17+0200 [INFO] [api_server:llm-llama-service:1] 192.168.1.1:60436 (scheme=http,method=POST,path=/v1/embeddings,type=application/json,length=308) (status=200,type=application/json,length=88380) 193.232ms (trace=3816de33066c9c4d96354641ef27341d,span=4fd179cbfa7a02f1,sampled=1,service.name=llm-llama-service)
2023-08-19T15:49:17+0200 [ERROR] [api_server:llm-llama-service:1] Exception on /v1/embeddings [POST] (trace=9637ce8614f23dbcc72a7c833a59015d,span=4c3bfe98ef3cab97,sampled=1,service.name=llm-llama-service)
Traceback (most recent call last):
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
    output = await api.func(*args)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/openllm/_service.py", line 37, in embeddings_v1
    responses = (await embed_call.async_run(phrases))[0]
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 242, in async_run_method
    raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-llama-runner: [500] Internal Server Error
2023-08-19T15:49:17+0200 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 273, in _request_handler
    payload = await infer(params)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
    raise r
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 377, in outbound_call
    outputs = await self.callback(
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 221, in infer_batch
    batch_ret = await runner_method.async_run(
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 59, in async_run_method
    return await anyio.to_thread.run_sync(
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runnable.py", line 140, in method
    return self.func(obj, *args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/openllm/_llm.py", line 1040, in embeddings
    return [self.embeddings([prompt] if isinstance(prompt, str) else prompt)]
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/openllm/models/llama/modeling_llama.py", line 21, in embeddings
    data = self.model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True).hidden_states[-1]
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
    outputs = self.model(
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
    layer_outputs = decoder_layer(
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 346, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 0; 15.74 GiB total capacity; 12.11 GiB already allocated; 830.69 MiB free; 14.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-19T15:49:17+0200 [INFO] [api_server:llm-llama-service:1] 192.168.1.1:60450 (scheme=http,method=POST,path=/v1/embeddings,type=application/json,length=17826) (status=500,type=application/json,length=2) 571.715ms (trace=9637ce8614f23dbcc72a7c833a59015d,span=4c3bfe98ef3cab97,sampled=1,service.name=llm-llama-service)

Environment

transformers: 4.31.0 openllm: 0.2.26 Python 3.8.10 CuDNN 11

System information (Optional)

No response

Aug 19 '23 14:08 H-Simpson123

Thank you @H-Simpson123 . I had the same issue too, but did not get around to writing up an issue. +1

Aug 21 '23 12:08 michaeltansg

I'm having a similar issue without using embeddings.

openllm: 0.2.26 python: 3.11 CUDA toolkit: 12.2.0 model: tiiuae/falcon-7b

Aug 21 '23 20:08 jwahlin

afaik 16GB of RAM should be able to load the model. Can you try in int8?

Aug 22 '23 07:08 aarnphm

FAISS.from_documents...

Are you using faiss library here?

Aug 22 '23 07:08 aarnphm

The FAISS code is called from a client and the openllm server is running on a different machine. The OOM crash is happening on the server side

Aug 22 '23 07:08 H-Simpson123

afaik 16GB of RAM should be able to load the model. Can you try in int8?

This is with int8. Please check my OP. The problem is not initial memory usage by the model, but that the memory consumption grows with each request

Aug 22 '23 07:08 H-Simpson123

hmm it seems weird since llm under the hood just call transformers generate. I will investigate thanks for reporting.

Aug 22 '23 07:08 aarnphm

Same here. RAM is not freed up on the GPU between completions. Using SOLAR-10.7B-Instruct-v1.0-AWQ on a 24GB RTX 4090. Starts off with 20444MiB / 24564MiB but within 4 to 10 prompts, always, the RAM is full and start getting CUDA out of memory errors.

Jan 06 '24 00:01 Speedway1

THis can be tested simply as follows:

Start up openllm
use nvidia-smi to monitor the GPU usage
use "openllm query" to send about 10 or 20 requests to the back end. What chews up the memory is the token parsing on input, not the output. Output doesn't use up the RAM.

Jan 07 '24 13:01 Speedway1

I've now tested Ollama and it doesn't have this issue. The memory consumption remains unchanged between calls. Testing with llamacpp server extension core dumps, so it's worse.

This is probably just a cache clear call that's missing between calls.

Jan 08 '24 00:01 Speedway1

close for openllm 0.6

Jul 12 '24 01:07 bojiang