TinyLlama
TinyLlama copied to clipboard
Working Chat Demo
@jzhang38
saw the Chat Demo wasn't working, So I made the Chat WebUI for the model, like how we discussed at the Old Pull, its similar to openchat for UI, working on all the side features, but the main thing is working, there's a chat between User and Bot.
Colab - https://colab.research.google.com/drive/1OaWYiHBt-nkSNCik6H0lhAWcpLCYvauq?usp=sharing
~~Its Wonky, since the Bot isn't trained on any multichat Data, and openassistant-guananco is really small, not really enough examples.~~ That was actually a bug on my end, its fixed and the chat is now really good!
This currently runs On GPU with vLLM, but it could also be run on a cpu with 2-4GB ram with a Llama.cpp-python for GGUF.
Image Example:
Overall UI:
Also, be sure to enter your Ngrok auth token for colab, or it wont work
the colab notebook is failing for me with a cuda error when vllm tries to import cuda_utils:
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
full traceback
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
[<ipython-input-2-4bae344bdc7c>](https://localhost:8080/#) in <cell line: 3>()
1 from flask import Flask, render_template, request
2 from flask_ngrok import run_with_ngrok
----> 3 from vllm import LLM, SamplingParams
4
5 app = Flask(__name__, template_folder='/content/tinyLlamaChat')
3 frames
[/usr/local/lib/python3.10/dist-packages/vllm/__init__.py](https://localhost:8080/#) in <module>
1 """vLLM: a high-throughput and memory-efficient inference engine for LLMs"""
2
----> 3 from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
4 from vllm.engine.async_llm_engine import AsyncLLMEngine
5 from vllm.engine.llm_engine import LLMEngine
[/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py](https://localhost:8080/#) in <module>
4 from typing import Optional, Tuple
5
----> 6 from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
7 SchedulerConfig)
8
[/usr/local/lib/python3.10/dist-packages/vllm/config.py](https://localhost:8080/#) in <module>
7 from vllm.logger import init_logger
8 from vllm.transformers_utils.config import get_config
----> 9 from vllm.utils import get_cpu_memory
10
11 logger = init_logger(__name__)
[/usr/local/lib/python3.10/dist-packages/vllm/utils.py](https://localhost:8080/#) in <module>
6 import torch
7
----> 8 from vllm import cuda_utils
9
10
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
That cuda error seems to be an issue with the latest vllm (v0.2.2). pinning to the version installed in your original colab with !pip install vllm==0.2.0 gets past that issue. But the run still fails with an error about incompatible xformers and CUDA versions
full traceback
WARNING:xformers:WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.1.0+cu118)
Python 3.10.13 (you have 3.10.12)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
model.safetensors: 100%
4.40G/4.40G [00:25<00:00, 171MB/s]
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
[<ipython-input-4-4bae344bdc7c>](https://localhost:8080/#) in <cell line: 8>()
6 run_with_ngrok(app)
7
----> 8 llm = LLM("PY007/TinyLlama-1.1B-Chat-v0.1") # tinyLlama-chat
9
10 @app.route("/")
27 frames
[/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/dispatch.py](https://localhost:8080/#) in _run_priority_list(name, priority_list, inp)
61 for op, not_supported in zip(priority_list, not_supported_reasons):
62 msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 63 raise NotImplementedError(msg)
64
65
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
query : shape=(1, 2048, 32, 64) (torch.float16)
key : shape=(1, 2048, 32, 64) (torch.float16)
value : shape=(1, 2048, 32, 64) (torch.float16)
attn_bias : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
p : 0.0
`decoderF` is not supported because:
xFormers wasn't build with CUDA support
attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
operator wasn't built - see `python -m xformers.info` for more info
`[email protected]` is not supported because:
xFormers wasn't build with CUDA support
requires device with capability > (8, 0) but your GPU has capability (7, 5) (too old)
operator wasn't built - see `python -m xformers.info` for more info
`tritonflashattF` is not supported because:
xFormers wasn't build with CUDA support
requires device with capability > (8, 0) but your GPU has capability (7, 5) (too old)
attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
operator wasn't built - see `python -m xformers.info` for more info
triton is not available
requires GPU with sm80 minimum compute capacity, e.g., A100/H100/L4
Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
xFormers wasn't build with CUDA support
operator wasn't built - see `python -m xformers.info` for more info
`smallkF` is not supported because:
max(query.shape[-1] != value.shape[-1]) > 32
xFormers wasn't build with CUDA support
dtype=torch.float16 (supported: {torch.float32})
attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
has custom scale
operator wasn't built - see `python -m xformers.info` for more info
unsupported embed per head: 64
Installing xformers with the right version of CUDA/Torch/xformers compat seemed to fix that:
!pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
And it ran successfully -- at this point I was able to open the UI. But sending a message fails, because the server code is trying to parse temp and topP as an ints
full traceback
INFO 12-01 15:14:03 llm_engine.py:72] Initializing an LLM engine with config: model='PY007/TinyLlama-1.1B-Chat-v0.1', tokenizer='PY007/TinyLlama-1.1B-Chat-v0.1', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 12-01 15:14:03 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 12-01 15:14:23 llm_engine.py:205] # GPU blocks: 32042, # CPU blocks: 11915
* Serving Flask app '__main__'
* Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on http://127.0.0.1:5000/
INFO:werkzeug:Press CTRL+C to quit
* Running on http://bebe-34-133-111-31.ngrok-free.app/
* Traffic stats available on http://127.0.0.1:4040/
INFO:werkzeug:127.0.0.1 - - [01/Dec/2023 15:15:00] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [01/Dec/2023 15:15:02] "GET /favicon.ico HTTP/1.1" 404 -
ERROR:__main__:Exception on /respond [GET]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 2529, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1825, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1823, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1799, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "<ipython-input-1-4bae344bdc7c>", line 17, in respond
temp = int(request.args.get('temp'))
ValueError: invalid literal for int() with base 10: '0.8'
INFO:werkzeug:127.0.0.1 - - [01/Dec/2023 15:16:35] "GET /respond?input=%23%23%23+Human:tell+me+a+joke%23%23%23+Assistant:&temp=0.8&topP=0.95&maxTok=100 HTTP/1.1" 500 -
changing those ints to float (and restarting the runtime) seemed to be enough for it to work successfully end-to-end 🎉
@gabrielgrant thanks, I've incorporated all the changes, it works again!
ideally would be good to figure out why it's blowing up with the latest vllm rather than pegging to an old version. do you have any insights?
Not really, its pure llama, so vllm should be fine with it, but future versions of this also need to move to chatML, which I believe the newer finetunes have moved to
saw this PR only removed the old HF chat link. would it make sense to add a link to this colab into the readme?