fastllm
fastllm copied to clipboard
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
不能直接用msys2或mingw64吗
I currently have multiple graphics cards, and I want to have a model on each card. If so, I need to place the model on the specified graphics card. So,...
In file included from /usr/local/include/c++/10.1.0/cstdint:35, from /opt/module/fastllm-master/include/fastllm.h:9, from /opt/module/fastllm-master/include/devices/cuda/fastllm-cuda.cuh:1, from /opt/module/fastllm-master/src/devices/cuda/fastllm-cuda.cu:9: /usr/local/include/c++/10.1.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must...
百川模型转换问题
ValueError: Can't find 'adapter_config.json' at 'hiyouga/baichuan-7b-sft'
感谢作者开源这个很有价值的工作 在我的测试中float16, int8, int4速度上没有明显的差异,大概均为13ms/token - 15ms/token 达不到报告中的176tokens/s,相当于5.68ms/token 我的硬件环境为: cuda11.8, A100 测试code如下,dtype可以调整为"float16", "int8", "int4" ```Python tokenizer = AutoTokenizer.from_pretrained("chatglm-6b", trust_remote_code=True) model = AutoModel.from_pretrained("chatglm-6b", trust_remote_code=True) model = llm.from_hf(model, tokenizer, dtype="int4") # 可以调整为"float16", "int8",...
python调用,指定gpu id是llm.model("model.flm"),cuda(device_id)吗?