lightllm
lightllm copied to clipboard
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Same as https://github.com/vllm-project/vllm/issues/182#issuecomment-1627176207
when you install triton 2.0.0.dev20221202, the find the compiler.py in ****/python3.9/site-packages/triton/. in L998 - L1018. change to 
您好, 我使用单块A800进行部署推理时正常,但是使用多卡推理会报错: `Task exception was never retrieved future: Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/rpyc/core/stream.py", line 268, in read buf = self.sock.recv(min(self.MAX_IO_CHUNK, count)) ConnectionResetError: [Errno 104] Connection reset by peer...
# LightLLM运行过程 复现kvoff分支 ##### 第一步:创建docker 拉取镜像:`docker pull ghcr.io/modeltc/lightllm:main` llama-7b模型过大,在服务器的docker中直接clone总是发生网络中断,因此我将该模型下载到本地,通过Xftp传输到服务器中,而后在创建docker时将模型文件夹映射到lightllm源码的models文件夹中。 模型仓库:[[huggyllama/llama-7b · Hugging Face](https://huggingface.co/huggyllama/llama-7b)](https://huggingface.co/huggyllama/llama-7b) ``` docker run -itd --ipc=host --net=host --name lxn_lightllm --gpus all -p 8080:8080 -v /hdd/lxn/llama-7b:/lightllm/lightllm/models/llama-7b ghcr.io/modeltc/lightllm:main /bin/bash ```...
代码报错信息: 在V100的机器上,显存32G。能正常启动,当跑一条短query时,就报out of memory 错误。 python3 -m lightllm.server.api_server --model_dir /app/baichuan2-13B --trust_remote_code --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 6000 Using a slow tokenizer. This might cause a significant slowdown. Consider...
1, python -m lightllm.server.api_server --model_dir baichuan-13b --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 4096 --trust_remote_code success and can see log : INFO: Started server process [560] INFO: Waiting for application...
Thanks for the project! We want to run lightllm directly in a cloud container environment where the current way to provide a `model_dir` is harder than providing a huggingface model...
I have a question arising from reading the code. I notice that in `~/lightllm/models/llama2/layer_infer/transformer_layer_infer.py`, the flash attention is only applied to the prefilling stage, i.e. the `context_attention_fwd`, but not to...
**Issue description:** 感觉目前的实现跟openai标准的输出的不太一样: 1. finish_reason全都是null,即使生成到最后一个字符了也是null,正常应该是"stop"或"length"吧 2. index全是0 3. stop参数目前不支持:"The stop parameter is not currently supported" 4. 在启动服务时,已经设置--eos_id 151645的情况下,生成的内容虽然在之后终止了,但还是会返回,正常情况下这个字符不应该返回的吧 **Steps to reproduce:** 请求示例: { "model": "Qwen", "messages": [ { "role": "user",...