alpaca-lora Sharing updated 30B Alpaca-LoRA chatbot playground

I have shared the one at #87 , and it it still running.

Today, I want to share a little modified version of it running on a separate A6000 instance. Below is what has changed:

streaming mode
context awareness

I have noticed streaming mode works just OK when I limit two connections at a time, so I just did it. Also, I found there is a bug on summarize button and context field, so I fixed the problem. Check out the repo as well: https://github.com/deep-diver/Alpaca-LoRA-Serve

here is the link of the currently up and running demo : https://notebookse.jarvislabs.ai/BuOu_VbEuUHb09VEVHhfnFq4-PMhBRVCcfHBRCOrq7c4O9GI4dIGoidvNf76UsRL

I appreciate if there is any feedback, and suggestions particularly how to speed things up. Or do you like streaming version or dynamic batching version?

Mar 22 '23 02:03 deep-diver

Hello @deep-diver, I wanted to express my appreciation for the remarkable Alpaca-LoRA-Serve project. Do you think it is possible to tweak gen.py to inference with V100 and chansung/alpaca-lora-30b without OOM?

$ nvidia-smi
Wed Mar 22 16:29:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:3E:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32'
$ BASE_URL=decapoda-research/llama-30b-hf
$ FINETUNED_CKPT_URL=tloen/chansung/alpaca-lora-30b
$ python app.py --base_url $BASE_URL --ft_ckpt_url $FINETUNED_CKPT_URL --port 5000 --batch_size 1
...
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████| 61/61 [01:02<00:00,  1.03s/it]
Running on local URL:  http://0.0.0.0:5000

To create a public link, set `share=True` in `launch()`.


### Instruction:hello world

### Response:
Traceback (most recent call last):
  File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
  File "/home/foo123/Alpaca-LoRA-Serve/gen.py", line 119, in _infer
    return model_fn(**kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/peft/peft_model.py", line 529, in forward
    return self.base_model(
  File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/foo123/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/foo123/.local/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
    result = super().forward(x)
  File "/home/foo123/.local/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/foo123/.local/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/foo123/.local/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/foo123/.local/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 390, in forward
    output = torch.nn.functional.linear(A_wo_outliers, state.CB.to(A.dtype))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB (GPU 0; 31.75 GiB total capacity; 31.33 GiB already allocated; 54.75 MiB free; 31.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/foo123/.local/lib/python3.9/site-packages/gradio/routes.py", line 384, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/foo123/.local/lib/python3.9/site-packages/gradio/blocks.py", line 1032, in process_api
    result = await self.call_function(
  File "/home/foo123/.local/lib/python3.9/site-packages/gradio/blocks.py", line 858, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/foo123/.local/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/foo123/.local/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/foo123/.local/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/foo123/.local/lib/python3.9/site-packages/gradio/utils.py", line 448, in async_iteration
    return next(iterator)
  File "/home/foo123/Alpaca-LoRA-Serve/app.py", line 31, in chat_stream
    for tokens in bot_response:
  File "/home/foo123/Alpaca-LoRA-Serve/gen.py", line 87, in __call__
    for (
  File "/home/foo123/Alpaca-LoRA-Serve/gen.py", line 238, in generate
    outputs = self._infer(
  File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
  File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
  File "/home/foo123/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()
tenacity.RetryError: RetryError[<Future at 0x7f9564327af0 state=finished raised OutOfMemoryError>]

Mar 22 '23 08:03 y12studio

Thanks for the kind words :)

Unfortunately V100's memory is insufficient to run 8Bit Alpaca LoRA model.

Mar 22 '23 09:03 deep-diver

Wow, it's super faster now. Is this just because of the streaming mode or you changed sth else?

Mar 22 '23 09:03 makovez

Thanks for sharing! Does this use the chansung/alpaca-lora-30b Lora ? Was this trained with the same provided Alpaca dataset ?

Mar 22 '23 11:03 BEpresent

The one from my account which was traine on the cleaned dataset provided by this repo

Mar 22 '23 11:03 deep-diver

Nice. Seems really good. The best alpaca I tried till now. It seems the 30b model is the way to to go. Not easy to run ok consumer hardware though.

Mar 22 '23 22:03 DanielWe2

@DanielWe2 thanks! good that you liked :)

Mar 23 '23 05:03 deep-diver

alpaca-lora alpaca-lora copied to clipboard

Sharing updated 30B Alpaca-LoRA chatbot playground

alpaca-lora
alpaca-lora copied to clipboard