lightllm icon indicating copy to clipboard operation
lightllm copied to clipboard

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Results 125 lightllm issues
Sort by recently updated
recently updated
newest added

How to use 8bit quantized models? Can I run GGML/GGUF models?

the Solution i use is : pkill -9 -f lightllm.server.api_server fuser -k /dev/nvidia0 and as you can see ,it will kill other process and my boss will kill me,so please...

下面是我在A100-sxm-80G上的测试结果: vllm `python -m vllm.entrypoints.api_server --model /code/llama-65b-hf --swap-space 16 --disable-log-requests --tensor-parallel-size 8` `python benchmarks/benchmark_serving.py --tokenizer /code/llama-65b-hf --dataset /code/ShareGPT_V3_unfiltered_cleaned_split.json` Total time: 312.02 s Throughput: 3.20 requests/s Average latency: 125.45 s Average...

Any plans to support https://github.com/deepseek-ai/deepseek-coder/ in the near future?

bug

When i try to add some stop_words for model,i found parameter stop_sequences in “lightllm/server/sampling_params.py”. There seems to be some issues with line 67: “ if stop_str_ids is not None and...

bug

需求背景: TGI适配lightllm,多卡加载模型的时候,用到几张卡就会有几个进程,并且每个进程都会完整的加载整个模型到内存中来。 当模型文件太大,比如65B以上的模型,使用8卡加载的话就会需要8*130G的内存,这显然是不合理的,会导致OOM。 解决办法: 可在lightllm中帮忙提供load_from_weight_dict(weight_dict) 接口。TGI层传入权重词典,一边加载一边释放内存,才能解决此问题。

issue: https://github.com/ModelTC/lightllm/issues/277

When specifying 'max new tokens', LightLLM's output consistently matches this maximum value. However, Transformers sometimes adjust according to the model itself, resulting in outputs shorter than the specified 'max new...

bug

实测sqlcoder2(基于starcoder)模型的速度比vllm快,但是输出内容与原版模型相差甚远,是否完全是因为不支持beam search的问题?

**Before you submit an issue, please search for existing issues to avoid duplicates.** **Issue description:** Please provide a clear and concise description of your issue. **Steps to reproduce:** Please list...

bug