lightllm
lightllm copied to clipboard
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Hi, Thanks for your great work, are there any plans to support models like GPT-NEO and GPT-NEOX?
one 3090-24g gpu, load multi instance, like triton
How can this problem be solved?? self.value_buffer = [torch.empty((size, head_num, head_dim), dtype=dtype, device="cuda") for _ in range(layer_num)] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 0; 79.35...
调用报错
/usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda /usr/bin/ld: cannot find -lcuda: No such file or directory /usr/bin/ld: skipping incompatible /usr/lib32/libcuda.so when searching for -lcuda collect2: error: ld returned 1...
as title mentioned
requirements.txt中是torch 2.0.0;安装的时候和triton 2.1.0 不兼容; 安装时triton改为2.0.0安装; 安装后单独更新安装triton至2.1.0版本; server可以正常运行,请求时发生错误: > /root/.triton/llvm/llvm+mlir-17.0.0-x86_64-linux-gnu-centos-7-release/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From&) [with To = mlir::triton::gpu::BLockedEncodingAttr; From = mlir::Attribute]: Asserttion `isa(Val) && "cast() argument of incompatible type!"' faliled 基础环境:redhat 7,...
[context_flashattention_nopad_fp16_fp8.txt](https://github.com/user-attachments/files/16421521/context_flashattention_nopad_fp16_fp8.txt) we have implemented a f8 version of context_flashattention_nopad.py. the v shape needs to be changed for performance improvement described in https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html. however, the current result is not correct, could...