bug: Memory usage is increased with each request
Describe the bug
I'm trying to create embeddings for some documents with langchain and openllm. With each request the GPU RAM consumption increases by some hundred MB until OpenLLM crashes because OOM error. On startup the openllm process allocates around 8GiB of GPU Ram. The max. RAM of the GPU is 16GiB.
To reproduce
Server side:
openllm start llama --model-id meta-llama/Llama-2-7b-chat-hf --quantize int8
Client Side:
class OpenLLMEmbeddings(Embeddings):
def __init__(self, client):
self.client = client
def embed_documents(self, texts):
return [ self.embed_query(text) for text in texts ]
def embed_query(self, text: str):
return self.client.embed(text).embeddings[0]
openllm_client = openllm.client.HTTPClient('http://some.server.local:3000')
embeddings = OpenLLMEmbeddings(openllm_client)
pages = confluence_loader.load(...)
index = FAISS.from_documents(pages, embeddings) <--crash
Logs
2023-08-19T15:49:17+0200 [INFO] [api_server:llm-llama-service:1] 192.168.1.1:60436 (scheme=http,method=POST,path=/v1/embeddings,type=application/json,length=308) (status=200,type=application/json,length=88380) 193.232ms (trace=3816de33066c9c4d96354641ef27341d,span=4fd179cbfa7a02f1,sampled=1,service.name=llm-llama-service)
2023-08-19T15:49:17+0200 [ERROR] [api_server:llm-llama-service:1] Exception on /v1/embeddings [POST] (trace=9637ce8614f23dbcc72a7c833a59015d,span=4c3bfe98ef3cab97,sampled=1,service.name=llm-llama-service)
Traceback (most recent call last):
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
output = await api.func(*args)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/openllm/_service.py", line 37, in embeddings_v1
responses = (await embed_call.async_run(phrases))[0]
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 242, in async_run_method
raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-llama-runner: [500] Internal Server Error
2023-08-19T15:49:17+0200 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/home/dude/openllm/venv/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/dude/openllm/venv/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
raise exc
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
await self.app(scope, receive, sender)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/routing.py", line 718, in __call__
await route.handle(scope, receive, send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/routing.py", line 276, in handle
await self.app(scope, receive, send)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/starlette/routing.py", line 66, in app
response = await func(request)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 273, in _request_handler
payload = await infer(params)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
raise r
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/marshal/dispatcher.py", line 377, in outbound_call
outputs = await self.callback(
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/server/runner_app.py", line 221, in infer_batch
batch_ret = await runner_method.async_run(
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/local.py", line 59, in async_run_method
return await anyio.to_thread.run_sync(
File "/home/dude/openllm/venv/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/dude/openllm/venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/dude/openllm/venv/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/bentoml/_internal/runner/runnable.py", line 140, in method
return self.func(obj, *args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/openllm/_llm.py", line 1040, in embeddings
return [self.embeddings([prompt] if isinstance(prompt, str) else prompt)]
File "/home/dude/openllm/venv/lib/python3.8/site-packages/openllm/models/llama/modeling_llama.py", line 21, in embeddings
data = self.model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True).hidden_states[-1]
File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
outputs = self.model(
File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 693, in forward
layer_outputs = decoder_layer(
File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/dude/openllm/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/dude/openllm/venv/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 346, in forward
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 0; 15.74 GiB total capacity; 12.11 GiB already allocated; 830.69 MiB free; 14.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-19T15:49:17+0200 [INFO] [api_server:llm-llama-service:1] 192.168.1.1:60450 (scheme=http,method=POST,path=/v1/embeddings,type=application/json,length=17826) (status=500,type=application/json,length=2) 571.715ms (trace=9637ce8614f23dbcc72a7c833a59015d,span=4c3bfe98ef3cab97,sampled=1,service.name=llm-llama-service)
Environment
transformers: 4.31.0 openllm: 0.2.26 Python 3.8.10 CuDNN 11
System information (Optional)
No response
Thank you @H-Simpson123 . I had the same issue too, but did not get around to writing up an issue. +1
I'm having a similar issue without using embeddings.
openllm: 0.2.26 python: 3.11 CUDA toolkit: 12.2.0 model: tiiuae/falcon-7b
afaik 16GB of RAM should be able to load the model. Can you try in int8?
FAISS.from_documents...
Are you using faiss library here?
The FAISS code is called from a client and the openllm server is running on a different machine. The OOM crash is happening on the server side
afaik 16GB of RAM should be able to load the model. Can you try in int8?
This is with int8. Please check my OP. The problem is not initial memory usage by the model, but that the memory consumption grows with each request
hmm it seems weird since llm under the hood just call transformers generate. I will investigate thanks for reporting.
Same here. RAM is not freed up on the GPU between completions. Using SOLAR-10.7B-Instruct-v1.0-AWQ on a 24GB RTX 4090. Starts off with 20444MiB / 24564MiB but within 4 to 10 prompts, always, the RAM is full and start getting CUDA out of memory errors.
THis can be tested simply as follows:
- Start up openllm
- use nvidia-smi to monitor the GPU usage
- use "openllm query" to send about 10 or 20 requests to the back end. What chews up the memory is the token parsing on input, not the output. Output doesn't use up the RAM.
I've now tested Ollama and it doesn't have this issue. The memory consumption remains unchanged between calls. Testing with llamacpp server extension core dumps, so it's worse.
This is probably just a cache clear call that's missing between calls.
close for openllm 0.6