TinyLlama Working Chat Demo

trafficstars

@jzhang38

saw the Chat Demo wasn't working, So I made the Chat WebUI for the model, like how we discussed at the Old Pull, its similar to openchat for UI, working on all the side features, but the main thing is working, there's a chat between User and Bot.

Colab - https://colab.research.google.com/drive/1OaWYiHBt-nkSNCik6H0lhAWcpLCYvauq?usp=sharing

~~Its Wonky, since the Bot isn't trained on any multichat Data, and openassistant-guananco is really small, not really enough examples.~~ That was actually a bug on my end, its fixed and the chat is now really good!

This currently runs On GPU with vLLM, but it could also be run on a cpu with 2-4GB ram with a Llama.cpp-python for GGUF.

Sep 16 '23 23:09 VatsaDev

Image Example: Screen Shot 2023-09-16 at 8 54 06 PM

Overall UI: Screen Shot 2023-09-16 at 8 55 43 PM

Sep 17 '23 01:09 VatsaDev

Also, be sure to enter your Ngrok auth token for colab, or it wont work

Sep 17 '23 03:09 VatsaDev

the colab notebook is failing for me with a cuda error when vllm tries to import cuda_utils:

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

full traceback

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
[<ipython-input-2-4bae344bdc7c>](https://localhost:8080/#) in <cell line: 3>()
      1 from flask import Flask, render_template, request
      2 from flask_ngrok import run_with_ngrok
----> 3 from vllm import LLM, SamplingParams
      4 
      5 app = Flask(__name__, template_folder='/content/tinyLlamaChat')

3 frames
[/usr/local/lib/python3.10/dist-packages/vllm/__init__.py](https://localhost:8080/#) in <module>
      1 """vLLM: a high-throughput and memory-efficient inference engine for LLMs"""
      2 
----> 3 from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
      4 from vllm.engine.async_llm_engine import AsyncLLMEngine
      5 from vllm.engine.llm_engine import LLMEngine

[/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py](https://localhost:8080/#) in <module>
      4 from typing import Optional, Tuple
      5 
----> 6 from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
      7                          SchedulerConfig)
      8 

[/usr/local/lib/python3.10/dist-packages/vllm/config.py](https://localhost:8080/#) in <module>
      7 from vllm.logger import init_logger
      8 from vllm.transformers_utils.config import get_config
----> 9 from vllm.utils import get_cpu_memory
     10 
     11 logger = init_logger(__name__)

[/usr/local/lib/python3.10/dist-packages/vllm/utils.py](https://localhost:8080/#) in <module>
      6 import torch
      7 
----> 8 from vllm import cuda_utils
      9 
     10 

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Dec 01 '23 15:12 gabrielgrant

That cuda error seems to be an issue with the latest vllm (v0.2.2). pinning to the version installed in your original colab with !pip install vllm==0.2.0 gets past that issue. But the run still fails with an error about incompatible xformers and CUDA versions

full traceback

WARNING:xformers:WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.1.0+cu118)
    Python  3.10.13 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
model.safetensors: 100%
4.40G/4.40G [00:25<00:00, 171MB/s]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
[<ipython-input-4-4bae344bdc7c>](https://localhost:8080/#) in <cell line: 8>()
      6 run_with_ngrok(app)
      7 
----> 8 llm = LLM("PY007/TinyLlama-1.1B-Chat-v0.1") # tinyLlama-chat
      9 
     10 @app.route("/")

27 frames
[/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/dispatch.py](https://localhost:8080/#) in _run_priority_list(name, priority_list, inp)
     61     for op, not_supported in zip(priority_list, not_supported_reasons):
     62         msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 63     raise NotImplementedError(msg)
     64 
     65 

NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 2048, 32, 64) (torch.float16)
     key         : shape=(1, 2048, 32, 64) (torch.float16)
     value       : shape=(1, 2048, 32, 64) (torch.float16)
     attn_bias   : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
     p           : 0.0
`decoderF` is not supported because:
    xFormers wasn't build with CUDA support
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
    operator wasn't built - see `python -m xformers.info` for more info
`[email protected]` is not supported because:
    xFormers wasn't build with CUDA support
    requires device with capability > (8, 0) but your GPU has capability (7, 5) (too old)
    operator wasn't built - see `python -m xformers.info` for more info
`tritonflashattF` is not supported because:
    xFormers wasn't build with CUDA support
    requires device with capability > (8, 0) but your GPU has capability (7, 5) (too old)
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
    operator wasn't built - see `python -m xformers.info` for more info
    triton is not available
    requires GPU with sm80 minimum compute capacity, e.g., A100/H100/L4
    Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
    xFormers wasn't build with CUDA support
    operator wasn't built - see `python -m xformers.info` for more info
`smallkF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 32
    xFormers wasn't build with CUDA support
    dtype=torch.float16 (supported: {torch.float32})
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
    has custom scale
    operator wasn't built - see `python -m xformers.info` for more info
    unsupported embed per head: 64

Installing xformers with the right version of CUDA/Torch/xformers compat seemed to fix that:

!pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118

And it ran successfully -- at this point I was able to open the UI. But sending a message fails, because the server code is trying to parse temp and topP as an ints

full traceback

INFO 12-01 15:14:03 llm_engine.py:72] Initializing an LLM engine with config: model='PY007/TinyLlama-1.1B-Chat-v0.1', tokenizer='PY007/TinyLlama-1.1B-Chat-v0.1', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 12-01 15:14:03 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
INFO 12-01 15:14:23 llm_engine.py:205] # GPU blocks: 32042, # CPU blocks: 11915
 * Serving Flask app '__main__'
 * Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000/
INFO:werkzeug:Press CTRL+C to quit
 * Running on http://bebe-34-133-111-31.ngrok-free.app/
 * Traffic stats available on http://127.0.0.1:4040/
INFO:werkzeug:127.0.0.1 - - [01/Dec/2023 15:15:00] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [01/Dec/2023 15:15:02] "GET /favicon.ico HTTP/1.1" 404 -
ERROR:__main__:Exception on /respond [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 2529, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1825, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1823, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 1799, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "<ipython-input-1-4bae344bdc7c>", line 17, in respond
    temp = int(request.args.get('temp'))
ValueError: invalid literal for int() with base 10: '0.8'
INFO:werkzeug:127.0.0.1 - - [01/Dec/2023 15:16:35] "GET /respond?input=%23%23%23+Human:tell+me+a+joke%23%23%23+Assistant:&temp=0.8&topP=0.95&maxTok=100 HTTP/1.1" 500 -

changing those ints to float (and restarting the runtime) seemed to be enough for it to work successfully end-to-end 🎉

Dec 01 '23 15:12 gabrielgrant

@gabrielgrant thanks, I've incorporated all the changes, it works again!

Dec 03 '23 17:12 VatsaDev

ideally would be good to figure out why it's blowing up with the latest vllm rather than pegging to an old version. do you have any insights?

Dec 03 '23 22:12 gabrielgrant

Not really, its pure llama, so vllm should be fine with it, but future versions of this also need to move to chatML, which I believe the newer finetunes have moved to

Dec 06 '23 22:12 VatsaDev

saw this PR only removed the old HF chat link. would it make sense to add a link to this colab into the readme?

Jan 13 '24 21:01 gabrielgrant

TinyLlama TinyLlama copied to clipboard

Working Chat Demo

TinyLlama
TinyLlama copied to clipboard