OpenLLM
OpenLLM copied to clipboard
bug: 40GB is not enough for llama-2-7b
Describe the bug
Starting a server with a int4 quantized models on a 40GB GPU throws of OOM errors. If I am not mistaken the quantization should result in a ~3.5 GB model and even the original model ~28GB should easily fit in memory.
To reproduce
-
openllm start llama --model-id meta-llama/Llama-2-7b-hf --quantize int4
-
curl -XPOST localhost:3000/v1/generate -H 'content-type: application/json' -d '{"prompt": "hello there, my name is "}'
Logs
2023-08-28T22:20:12+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 69, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 291, in _request_handler
payload = await infer(params)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
raise r
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/marshal/dispatcher.py", line 377, in outbound_call
outputs = await self.callback(
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 271, in infer_single
ret = await runner_method.async_run(*params.args, **params.kwargs)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 62, in async_run_method
return await anyio.to_thread.run_sync(
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runnable.py", line 143, in method
return self.func(obj, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1168, in generate
return self.generate(prompt, **attrs)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 986, in generate
for it in self.generate_iterator(prompt, **attrs):
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1024, in generate_iterator
out = self.model(torch.as_tensor([input_ids], device=self.device), use_cache=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 85.12 MiB is free. Process 17224 has 39.31 GiB memory in use. Of the allocated memory 36.93 GiB is allocated by PyTorch, and 1.79 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-28T22:20:12+0000 [ERROR] [api_server:llm-llama-service:6] Exception on /v1/generate [POST] (trace=64f1f2855c26af6f1d6212b60faeab5d,span=2f1792bb2f5b298e,sampled=1,service.name=llm-llama-service)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
output = await api.func(*args)
File "/usr/local/lib/python3.10/dist-packages/openllm/_service.py", line 36, in generate_v1
responses = await runner.generate.async_run(qa_inputs.prompt, **{'adapter_name': qa_inputs.adapter_name, **config})
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 242, in async_run_method
raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-llama-runner: [500] Internal Server Error
### Environment
"openllm[llama,gptq]==0.2.27"
protobuf==3.20.3
bitsandbytes==0.41.0
### System information (Optional)
_No response_
How many GPU do you have?
Can you also send the whole stack trace from the server?
How many GPU do you have?
Only 1 A100 40GB
Can you also send the whole stack trace from the server?
Not complete but there a big chunk
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
async for p in payload:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
async for data in ret:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
yield await anyio.to_thread.run_sync(_next, iterator)
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
return next(iterator)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
for outputs in self.generate_iterator(prompt, **attrs):
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 75.12 MiB is free. Process 17107 has 39.32 GiB memory in use. Of the allocated memory 37.62 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:6] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
async for b, end_of_http_chunk in resp.content.iter_chunks():
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
rv = await self._stream.readchunk()
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
await self._wait("readchunk")
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
async for p in payload:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
async for data in ret:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
yield await anyio.to_thread.run_sync(_next, iterator)
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
return next(iterator)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
for outputs in self.generate_iterator(prompt, **attrs):
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 75.12 MiB is free. Process 17107 has 39.32 GiB memory in use. Of the allocated memory 37.65 GiB is allocated by PyTorch, and 1.09 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:6] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
async for b, end_of_http_chunk in resp.content.iter_chunks():
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
rv = await self._stream.readchunk()
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
await self._wait("readchunk")
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
async for p in payload:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
async for data in ret:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
yield await anyio.to_thread.run_sync(_next, iterator)
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
return next(iterator)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
for outputs in self.generate_iterator(prompt, **attrs):
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 85.12 MiB is free. Process 17107 has 39.31 GiB memory in use. Of the allocated memory 37.58 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:11] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
async for b, end_of_http_chunk in resp.content.iter_chunks():
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
rv = await self._stream.readchunk()
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
await self._wait("readchunk")
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
async for p in payload:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
async for data in ret:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
yield await anyio.to_thread.run_sync(_next, iterator)
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
return next(iterator)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
for outputs in self.generate_iterator(prompt, **attrs):
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 49.12 MiB is free. Process 17107 has 39.34 GiB memory in use. Of the allocated memory 37.62 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:11] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
async for b, end_of_http_chunk in resp.content.iter_chunks():
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
rv = await self._stream.readchunk()
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
await self._wait("readchunk")
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
async for p in payload:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
async for data in ret:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
yield await anyio.to_thread.run_sync(_next, iterator)
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
return next(iterator)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
for outputs in self.generate_iterator(prompt, **attrs):
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 49.12 MiB is free. Process 17107 has 39.34 GiB memory in use. Of the allocated memory 37.62 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
async for b, end_of_http_chunk in resp.content.iter_chunks():
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
rv = await self._stream.readchunk()
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
await self._wait("readchunk")
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:17+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
async for p in payload:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
async for data in ret:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
yield await anyio.to_thread.run_sync(_next, iterator)
File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
return next(iterator)
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
for outputs in self.generate_iterator(prompt, **attrs):
File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
hidden_states = self.mlp(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 27.12 MiB is free. Process 17107 has 39.36 GiB memory in use. Of the allocated memory 37.81 GiB is allocated by PyTorch, and 1003.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:17+0000 [ERROR] [api_server:llm-llama-service:6] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
await self.app(scope, otel_receive, otel_send)
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
await self.app(scope, receive, wrapped_send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
await response(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
async with anyio.create_task_group() as task_group:
File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
raise exceptions[0]
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
await func()
File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
async for chunk in self.body_iterator:
File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
async for b, end_of_http_chunk in resp.content.iter_chunks():
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
rv = await self._stream.readchunk()
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
await self._wait("readchunk")
File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:9] 172.17.0.1:45274 (scheme=http,method=GET,path=/api,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 1.177ms (trace=817f402b2514f1b74a15c125ce18be1c,span=8a3ddba07f51ac85,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45282 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.834ms (trace=96a7bb2df2d520cfb0cff1945333902a,span=7a66b344beb302ba,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45288 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.676ms (trace=c88b71f6facc1d03460e5c6ae338d21f,span=7733fb897e86375c,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45294 (scheme=http,method=GET,path=/api/kernels,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.635ms (trace=61245fa755cd129a1d50a9002175727b,span=a3a950a6859b0128,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45310 (scheme=http,method=GET,path=/api/sessions,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.632ms (trace=ca9ba2c6172f14c94b660da53acc0316,span=b1d6276b9658e3b2,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45324 (scheme=http,method=GET,path=/api/terminals,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.638ms (trace=0a39ef6cb78ebe43f9b89f6566c77d12,span=3bfd08a359b3b3de,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51128 (scheme=http,method=GET,path=/api,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.921ms (trace=7ec21411d314f77fab046d00a322f7d9,span=2c60c089b3634328,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51134 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.750ms (trace=4e258c0ff298f0c24124d2852e206d8b,span=387300af92476321,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51150 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.680ms (trace=e0be02595163242f028cfe8cc0d1a4a8,span=6b3435e2fd726248,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51152 (scheme=http,method=GET,path=/api/kernels,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.699ms (trace=49aea1821f92b418bb851228acee6d54,span=781f810f146045c9,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51154 (scheme=http,method=GET,path=/api/sessions,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.641ms (trace=946de737203a55f5fb6c2ad0593f85f1,span=eca34071b37ebbd5,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51158 (scheme=http,method=GET,path=/api/terminals,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.636ms (trace=6b05df432d0aa3d0644e27438fc9caf0,span=f035c19152fcf010,sampled=1,service.name=llm-llama-service)
So did the model startup correctly? i was able to run llama-2-7b with 1 T4
Yes it did start up correctly. The model seem to be loaded but it fails to serve when it receives an inference request.
I am facing the same issue on Nvidia L4 (24 GB VRAM) using llama-2-7b-chat
(quantized versions).
openllm start llama --model-id meta-llama/Llama-2-7b-chat-hf --device 0 --quantize int8 --verbose
I tried two different GPUs and quantized versions (int4 and int8) of llama-2-7b
, but faced the same issue of OOM.
While testing, I tested with small prompt and it worked with memory consumption about 10-13 GB. The curl request using same prompt (Content-Length: 997
) also worked. But when I switched to a larger prompt (Content-Length: 2752
), it complains with OOM error.
This is consequent requests right?
Correct, sending a curl request after first completes.
Yes so this is currently a bug that has also been reported else where, I'm taking a look atm.
I am having the same issue with a A100 80GB:
openllm start llama --model_id meta-llama/Llama-2-13b-chat-hf --quantize int4
The error does not occur with the standard config of:
OPENLLM_LLAMA_GENERATION_MAX_NEW_TOKENS=128
Thanks a lot for your help!
btw you can change the max_new_tokens
per request. I'm going to change the environment variable changes soon
Are there any updates on this issue @aarnphm? Thanks! Any workaround available?
You can try --quantize gptq
for now
Just have a lot of priority atm
Version 0.3.5
fixed that issue on my side. Thanks a lot for your support!
@gbmarc1 can you try again with 0.3.5?
I have the same issue with a A100 40GB on Google Cloud.
@marijnbent what is your batch size and requests configuration?
Hey can you try again? I think this should be fixed by now.
I have the same issue on an A100 80GB. If I used --backend vllm
the usage of VRAM goes up at first to 15GB then after seconds to 70GB.
I am using openllm start NousResearch/llama-2-7b-hf --backend vllm
and with other models I have the same issue.
On vllm backend the VRAM usage is up to 70GB, but with Pytorch backend is the normal VRAM usage. Around 14GB to 24Gb depends on the Model.
Maybe there is an issue with the nvidia cache on the vllm?
If someone fix the problem please let me know. Specs:
Nvidia Driver Version: 535.54.03 CUDA Version: 12.2
openllm, 0.4.10 (compiled: False)
Python (CPython) 3.10.12
pip list:
openllm 0.4.10
openllm-client 0.4.10
openllm-core 0.4.10
vllm 0.2.1.post1
torch 2.0.1
sentence-transformers 2.2.2
sentencepiece 0.1.99
maybe try upgrading vllm
having the same error with llmware/bling-sheared-llama-1.3b-0.1
on 24 rtx 3090. when I lead the model it takes almost all the memory even though it should talk far less then 24GB.
I am using bentoml to serve the model but parallel requests raise the error immediately.
Any updates on how to fix that? facing the same issue when running mistral-7b quantized 4 bit with --backend vllm.
@pavel1860 @aarnphm @1E04 @marijnbent @Bernsai @dudeperf3ct