aresnow1
aresnow1
We've supported grammar input for ggml models in https://github.com/xorbitsai/inference/pull/525, are you interested in implementing this API?
确定注册的时候选的是 chat 吗,看上去是模型只有 generate 能力,没有 chat 能力。
roles 改成 ["Human", "Assistant"] 这个试一下 @faroasis
谢谢分享,如果愿意的话可以提个 PR 帮忙修复下吗? faroasis ***@***.***>于2023年12月1日 周五20:27写道: > roles 改成 ["Human", "Assistant"] 这个试一下 @faroasis > > > > 问题不在这里,应该是LlamaTokenizer词表太小,中文被切分了,如果是对整个output_ids做decode就是完整的中文字符。目前按照stream_interval输出的方式就是会把中文字符切碎。 > LlamaTokenizer(name_or_path='C:\llama2\cn_chat', vocab_size=32000, > model_max_length=1000000000000000019884624838656, is_fast=False, > padding_side='right', truncation_side='right', special_tokens={'bos_token': > AddedToken("",...
We need to test it, some adaptation work may be required.
It seems vllm's support for gptq is in progress(https://github.com/vllm-project/vllm/pull/1580), how did you use vllm with GPTQ?
It needs some changes to support passing quantization method, I'll create a PR to support this later.
Could you use `xinference --log-level=debug` to start service and run again? And paste all logs to get more information.
`pip show xinference` and `pip show chatglm_cpp` to check these packages' version.