OpenLLM bug: 40GB is not enough for llama-2-7b

Describe the bug

Starting a server with a int4 quantized models on a 40GB GPU throws of OOM errors. If I am not mistaken the quantization should result in a ~3.5 GB model and even the original model ~28GB should easily fit in memory.

To reproduce

openllm start llama --model-id meta-llama/Llama-2-7b-hf --quantize int4
curl -XPOST localhost:3000/v1/generate -H 'content-type: application/json' -d '{"prompt": "hello there, my name is "}'

Logs

2023-08-28T22:20:12+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 69, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 291, in _request_handler
    payload = await infer(params)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/marshal/dispatcher.py", line 182, in _func
    raise r
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/marshal/dispatcher.py", line 377, in outbound_call
    outputs = await self.callback(
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 271, in infer_single
    ret = await runner_method.async_run(*params.args, **params.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 62, in async_run_method
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runnable.py", line 143, in method
    return self.func(obj, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1168, in generate
    return self.generate(prompt, **attrs)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 986, in generate
    for it in self.generate_iterator(prompt, **attrs):
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1024, in generate_iterator
    out = self.model(torch.as_tensor([input_ids], device=self.device), use_cache=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 85.12 MiB is free. Process 17224 has 39.31 GiB memory in use. Of the allocated memory 36.93 GiB is allocated by PyTorch, and 1.79 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-28T22:20:12+0000 [ERROR] [api_server:llm-llama-service:6] Exception on /v1/generate [POST] (trace=64f1f2855c26af6f1d6212b60faeab5d,span=2f1792bb2f5b298e,sampled=1,service.name=llm-llama-service)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
    output = await api.func(*args)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_service.py", line 36, in generate_v1
    responses = await runner.generate.async_run(qa_inputs.prompt, **{'adapter_name': qa_inputs.adapter_name, **config})
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 242, in async_run_method
    raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-llama-runner: [500] Internal Server Error



### Environment

"openllm[llama,gptq]==0.2.27"
protobuf==3.20.3
bitsandbytes==0.41.0

### System information (Optional)

_No response_

Aug 28 '23 22:08 gbmarc1

How many GPU do you have?

Aug 29 '23 11:08 aarnphm

Can you also send the whole stack trace from the server?

Aug 29 '23 11:08 aarnphm

How many GPU do you have?

Only 1 A100 40GB

Aug 29 '23 20:08 gbmarc1

Can you also send the whole stack trace from the server?

Not complete but there a big chunk

2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
    async for p in payload:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
    async for data in ret:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
    return next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
    for outputs in self.generate_iterator(prompt, **attrs):
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
    out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 75.12 MiB is free. Process 17107 has 39.32 GiB memory in use. Of the allocated memory 37.62 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:6] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
    async for b, end_of_http_chunk in resp.content.iter_chunks():
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
    rv = await self._stream.readchunk()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
    await self._wait("readchunk")
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
    async for p in payload:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
    async for data in ret:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
    return next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
    for outputs in self.generate_iterator(prompt, **attrs):
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
    out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 75.12 MiB is free. Process 17107 has 39.32 GiB memory in use. Of the allocated memory 37.65 GiB is allocated by PyTorch, and 1.09 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:6] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
    async for b, end_of_http_chunk in resp.content.iter_chunks():
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
    rv = await self._stream.readchunk()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
    await self._wait("readchunk")
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
    async for p in payload:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
    async for data in ret:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
    return next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
    for outputs in self.generate_iterator(prompt, **attrs):
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
    out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 85.12 MiB is free. Process 17107 has 39.31 GiB memory in use. Of the allocated memory 37.58 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:11] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
    async for b, end_of_http_chunk in resp.content.iter_chunks():
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
    rv = await self._stream.readchunk()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
    await self._wait("readchunk")
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
    async for p in payload:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
    async for data in ret:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
    return next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
    for outputs in self.generate_iterator(prompt, **attrs):
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
    out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 49.12 MiB is free. Process 17107 has 39.34 GiB memory in use. Of the allocated memory 37.62 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:11] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
    async for b, end_of_http_chunk in resp.content.iter_chunks():
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
    rv = await self._stream.readchunk()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
    await self._wait("readchunk")
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:16+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
    async for p in payload:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
    async for data in ret:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
    return next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
    for outputs in self.generate_iterator(prompt, **attrs):
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
    out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 49.12 MiB is free. Process 17107 has 39.34 GiB memory in use. Of the allocated memory 37.62 GiB is allocated by PyTorch, and 1.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:16+0000 [ERROR] [api_server:llm-llama-service:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
    async for b, end_of_http_chunk in resp.content.iter_chunks():
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
    rv = await self._stream.readchunk()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
    await self._wait("readchunk")
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:17+0000 [ERROR] [runner:llm-llama-runner:1] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 252, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 341, in streamer
    async for p in payload:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/runner_app.py", line 209, in inner
    async for data in ret:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 179, in iterate_in_threadpool
    yield await anyio.to_thread.run_sync(_next, iterator)
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/utils.py", line 169, in _next
    return next(iterator)
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1181, in generate_iterator
    for outputs in self.generate_iterator(prompt, **attrs):
  File "/usr/local/lib/python3.10/dist-packages/openllm/_llm.py", line 1026, in generate_iterator
    out = self.model(input_ids=torch.as_tensor([[token]], device=self.device), use_cache=True, past_key_values=past_key_values)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 426, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 220, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 248, in forward
    out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 579, in matmul_4bit
    return MatMul4Bit.apply(A, B, out, bias, quant_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 516, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, state).to(A.dtype).t(), bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 27.12 MiB is free. Process 17107 has 39.36 GiB memory in use. Of the allocated memory 37.81 GiB is allocated by PyTorch, and 1003.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-08-29T21:07:17+0000 [ERROR] [api_server:llm-llama-service:6] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/traffic.py", line 26, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/instruments.py", line 135, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 580, in __call__
    await self.app(scope, otel_receive, otel_send)
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
    await self.app(scope, receive, wrapped_send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 727, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 285, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 57, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 46, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 270, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 597, in __aexit__
    raise exceptions[0]
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 273, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 262, in stream_response
    async for chunk in self.body_iterator:
  File "/usr/local/lib/python3.10/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 330, in async_stream_method
    async for b, end_of_http_chunk in resp.content.iter_chunks():
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 51, in __anext__
    rv = await self._stream.readchunk()
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 433, in readchunk
    await self._wait("readchunk")
  File "/usr/local/lib/python3.10/dist-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:9] 172.17.0.1:45274 (scheme=http,method=GET,path=/api,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 1.177ms (trace=817f402b2514f1b74a15c125ce18be1c,span=8a3ddba07f51ac85,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45282 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.834ms (trace=96a7bb2df2d520cfb0cff1945333902a,span=7a66b344beb302ba,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45288 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.676ms (trace=c88b71f6facc1d03460e5c6ae338d21f,span=7733fb897e86375c,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45294 (scheme=http,method=GET,path=/api/kernels,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.635ms (trace=61245fa755cd129a1d50a9002175727b,span=a3a950a6859b0128,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45310 (scheme=http,method=GET,path=/api/sessions,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.632ms (trace=ca9ba2c6172f14c94b660da53acc0316,span=b1d6276b9658e3b2,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:20+0000 [INFO] [api_server:llm-llama-service:6] 172.17.0.1:45324 (scheme=http,method=GET,path=/api/terminals,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.638ms (trace=0a39ef6cb78ebe43f9b89f6566c77d12,span=3bfd08a359b3b3de,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51128 (scheme=http,method=GET,path=/api,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.921ms (trace=7ec21411d314f77fab046d00a322f7d9,span=2c60c089b3634328,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51134 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.750ms (trace=4e258c0ff298f0c24124d2852e206d8b,span=387300af92476321,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51150 (scheme=http,method=GET,path=/api/status,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.680ms (trace=e0be02595163242f028cfe8cc0d1a4a8,span=6b3435e2fd726248,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51152 (scheme=http,method=GET,path=/api/kernels,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.699ms (trace=49aea1821f92b418bb851228acee6d54,span=781f810f146045c9,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51154 (scheme=http,method=GET,path=/api/sessions,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.641ms (trace=946de737203a55f5fb6c2ad0593f85f1,span=eca34071b37ebbd5,sampled=1,service.name=llm-llama-service)
2023-08-29T21:07:50+0000 [INFO] [api_server:llm-llama-service:12] 172.17.0.1:51158 (scheme=http,method=GET,path=/api/terminals,type=,length=) (status=404,type=text/plain; charset=utf-8,length=9) 0.636ms (trace=6b05df432d0aa3d0644e27438fc9caf0,span=f035c19152fcf010,sampled=1,service.name=llm-llama-service)

Aug 29 '23 21:08 gbmarc1

So did the model startup correctly? i was able to run llama-2-7b with 1 T4

Aug 29 '23 21:08 aarnphm

Yes it did start up correctly. The model seem to be loaded but it fails to serve when it receives an inference request.

Aug 29 '23 21:08 gbmarc1

I am facing the same issue on Nvidia L4 (24 GB VRAM) using llama-2-7b-chat (quantized versions).

openllm start llama --model-id meta-llama/Llama-2-7b-chat-hf --device 0 --quantize int8 --verbose

I tried two different GPUs and quantized versions (int4 and int8) of llama-2-7b, but faced the same issue of OOM.

While testing, I tested with small prompt and it worked with memory consumption about 10-13 GB. The curl request using same prompt (Content-Length: 997) also worked. But when I switched to a larger prompt (Content-Length: 2752), it complains with OOM error.

Aug 30 '23 06:08 dudeperf3ct

This is consequent requests right?

Aug 30 '23 08:08 aarnphm

Correct, sending a curl request after first completes.

Aug 30 '23 08:08 dudeperf3ct

Yes so this is currently a bug that has also been reported else where, I'm taking a look atm.

Aug 30 '23 08:08 aarnphm

I am having the same issue with a A100 80GB:

openllm start llama --model_id meta-llama/Llama-2-13b-chat-hf --quantize int4

The error does not occur with the standard config of:

OPENLLM_LLAMA_GENERATION_MAX_NEW_TOKENS=128

Thanks a lot for your help!

Aug 31 '23 10:08 Bernsai

btw you can change the max_new_tokens per request. I'm going to change the environment variable changes soon

Aug 31 '23 12:08 aarnphm

Are there any updates on this issue @aarnphm? Thanks! Any workaround available?

Sep 06 '23 11:09 Bernsai

You can try --quantize gptq for now

Just have a lot of priority atm

Sep 06 '23 13:09 aarnphm

Version 0.3.5 fixed that issue on my side. Thanks a lot for your support!

Sep 18 '23 08:09 Bernsai

@gbmarc1 can you try again with 0.3.5?

Sep 18 '23 16:09 aarnphm

I have the same issue with a A100 40GB on Google Cloud.

Oct 11 '23 16:10 marijnbent

@marijnbent what is your batch size and requests configuration?

Oct 12 '23 09:10 aarnphm

Hey can you try again? I think this should be fixed by now.

Nov 08 '23 13:11 aarnphm

I have the same issue on an A100 80GB. If I used --backend vllm the usage of VRAM goes up at first to 15GB then after seconds to 70GB. I am using openllm start NousResearch/llama-2-7b-hf --backend vllm and with other models I have the same issue. On vllm backend the VRAM usage is up to 70GB, but with Pytorch backend is the normal VRAM usage. Around 14GB to 24Gb depends on the Model. Maybe there is an issue with the nvidia cache on the vllm?

If someone fix the problem please let me know. Specs:

Nvidia Driver Version: 535.54.03    CUDA Version: 12.2
openllm, 0.4.10 (compiled: False)
Python (CPython) 3.10.12

pip list:
openllm                                      0.4.10
openllm-client                               0.4.10
openllm-core                                 0.4.10
vllm                                         0.2.1.post1
torch                                        2.0.1
sentence-transformers                        2.2.2
sentencepiece                                0.1.99

Dec 01 '23 10:12 1E04

maybe try upgrading vllm

Dec 02 '23 11:12 aarnphm

having the same error with llmware/bling-sheared-llama-1.3b-0.1 on 24 rtx 3090. when I lead the model it takes almost all the memory even though it should talk far less then 24GB. I am using bentoml to serve the model but parallel requests raise the error immediately.

Dec 04 '23 11:12 pavel1860

Any updates on how to fix that? facing the same issue when running mistral-7b quantized 4 bit with --backend vllm.

@pavel1860 @aarnphm @1E04 @marijnbent @Bernsai @dudeperf3ct

Dec 07 '23 10:12 matankley

OpenLLM OpenLLM copied to clipboard

bug: 40GB is not enough for llama-2-7b

Describe the bug

To reproduce

Logs

OpenLLM
OpenLLM copied to clipboard