alpaca-lora
alpaca-lora copied to clipboard
Sharing updated 30B Alpaca-LoRA chatbot playground
I have shared the one at #87 , and it it still running.
Today, I want to share a little modified version of it running on a separate A6000 instance. Below is what has changed:
- streaming mode
- context awareness
I have noticed streaming mode
works just OK when I limit two connections at a time, so I just did it. Also, I found there is a bug on summarize
button and context
field, so I fixed the problem. Check out the repo as well: https://github.com/deep-diver/Alpaca-LoRA-Serve
here is the link of the currently up and running demo : https://notebookse.jarvislabs.ai/BuOu_VbEuUHb09VEVHhfnFq4-PMhBRVCcfHBRCOrq7c4O9GI4dIGoidvNf76UsRL

I appreciate if there is any feedback, and suggestions particularly how to speed things up. Or do you like streaming version or dynamic batching version?
Hello @deep-diver, I wanted to express my appreciation for the remarkable Alpaca-LoRA-Serve project. Do you think it is possible to tweak gen.py
to inference with V100 and chansung/alpaca-lora-30b
without OOM?
$ nvidia-smi
Wed Mar 22 16:29:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32'
$ BASE_URL=decapoda-research/llama-30b-hf
$ FINETUNED_CKPT_URL=tloen/chansung/alpaca-lora-30b
$ python app.py --base_url $BASE_URL --ft_ckpt_url $FINETUNED_CKPT_URL --port 5000 --batch_size 1
...
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 61/61 [01:02<00:00, 1.03s/it]
Running on local URL: http://0.0.0.0:5000
To create a public link, set `share=True` in `launch()`.
### Instruction:hello world
### Response:
Traceback (most recent call last):
File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 382, in __call__
result = fn(*args, **kwargs)
File "/home/foo123/Alpaca-LoRA-Serve/gen.py", line 119, in _infer
return model_fn(**kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/foo123/.local/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/home/foo123/.local/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/foo123/.local/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/foo123/.local/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/foo123/.local/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 390, in forward
output = torch.nn.functional.linear(A_wo_outliers, state.CB.to(A.dtype))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 31.75 GiB total capacity; 31.33 GiB already allocated; 54.75 MiB free; 31.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/foo123/.local/lib/python3.9/site-packages/gradio/routes.py", line 384, in run_predict
output = await app.get_blocks().process_api(
File "/home/foo123/.local/lib/python3.9/site-packages/gradio/blocks.py", line 1032, in process_api
result = await self.call_function(
File "/home/foo123/.local/lib/python3.9/site-packages/gradio/blocks.py", line 858, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/foo123/.local/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/foo123/.local/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/foo123/.local/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/foo123/.local/lib/python3.9/site-packages/gradio/utils.py", line 448, in async_iteration
return next(iterator)
File "/home/foo123/Alpaca-LoRA-Serve/app.py", line 31, in chat_stream
for tokens in bot_response:
File "/home/foo123/Alpaca-LoRA-Serve/gen.py", line 87, in __call__
for (
File "/home/foo123/Alpaca-LoRA-Serve/gen.py", line 238, in generate
outputs = self._infer(
File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 289, in wrapped_f
return self(f, *args, **kw)
File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 379, in __call__
do = self.iter(retry_state=retry_state)
File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 326, in iter
raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7f9564327af0 state=finished raised OutOfMemoryError>]
Thanks for the kind words :)
Unfortunately V100's memory is insufficient to run 8Bit Alpaca LoRA model.
Wow, it's super faster now. Is this just because of the streaming mode or you changed sth else?
Thanks for sharing! Does this use the chansung/alpaca-lora-30b Lora ? Was this trained with the same provided Alpaca dataset ?
The one from my account which was traine on the cleaned dataset provided by this repo
Nice. Seems really good. The best alpaca I tried till now. It seems the 30b model is the way to to go. Not easy to run ok consumer hardware though.
@DanielWe2 thanks! good that you liked :)