Chinese-LLaMA-Alpaca
Chinese-LLaMA-Alpaca copied to clipboard
Add new Gradio web demo for Chinese-LLaMA-Alpaca
I write a simple web gradio demo which supports multi-round conversation and multi-card inference. Here is the shell command to start the web demo:
python gradio_demo.py --base_model /home/sunyuhan/syh/sunyuhan/zju/llama-7b-hf/ --lora_model /home/sunyuhan/syh/sunyuhan/zju/chinese-alpaca-lora-7b --with_prompt --gpus 4,5,6,7
Looks awesome! Thank you for contribution.
But I have a question about loading Alpaca Plus models. Since the Plus needs two LoRA weights, so does the demo support multi-LoRA loading? Or alternatively, does the it support loading only base_model
without lora_model
(since users can merge the LoRAs into the base model to get a single merged Plus model weight file)?
support base_model-only mode
support base_model-only mode
Thanks. We will perform some tests before confirming merging.
support base_model-only mode
Thanks. We will perform some tests before confirming merging.
Please let me know if you have any advice or if you've found any bugs as soon as possible. I would be grateful to receive your feedback, and I'm eager to work on improving based on your suggestions. Thank you very much.
在这个 notebook 中:
https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/notebooks/convert_and_quantize_chinese_llama.ipynbusp=sharing#scrollTo=EjkXqaqbmrVZ
使用如下命令: 能成功运行
!cd Chinese-LLaMA-Alpaca/ && python scripts/my_gradio_demo.py \ --base_model 'decapoda-research/llama-7b-hf' \ --lora_model 'ziqingyang/chinese-alpaca-lora-7b'
在这个 notebook :
https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/notebooks/finetune_chinese_alpaca_lora.ipynb
!cd Chinese-LLaMA-Alpaca/ && python scripts/my_gradio_demo.py \ --base_model 'decapoda-research/llama-7b-hf' \ --lora_model '/content/output_model/peft_model'
运行起来之后,提问会报错:
`
2023-05-15 08:58:49.995893: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading checkpoint shards: 100% 33/33 [00:18<00:00, 1.75it/s]
Vocab of the base model: 32000
Vocab of the tokenizer: 49954
Resize model embeddings to fit tokenizer
loading peft model
Running on local URL: http://127.0.0.1:7860/
Running on public URL: https://da6b02733e495413ee.gradio.live/
This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 414, in run_predict output = await app.get_blocks().process_api( File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1320, in process_api result = await self.call_function( File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1048, in call_function prediction = await anyio.to_thread.run_sync( File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/content/Chinese-LLaMA-Alpaca/scripts/my_gradio_demo.py", line 132, in predict generation_output = model.generate( File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 581, in generate outputs = self.base_model.generate(**kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1524, in generate return self.beam_search( File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2810, in beam_search outputs = self( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 687, in forward outputs = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 196, in forward query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora.py", line 358, in forward result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: expected scalar type Half but found Float Keyboard interruption in main thread... closing server.`
已解决: load_type = torch.float16 改为如下: load_type = torch.float32