fastllm issues

请教一下，并行计算动态batch这一块在哪里有实现，python有调用的示例吗？

14

White-Friday

请教：flm模型或者是llm.model支持指定GPU吗？默认都是GPU:0

7

White-Friday

How to specify the GPU serial number？

I currently have multiple graphics cards, and I want to have a model on each card. If so, I need to place the model on the specified graphics card. So,...

LanShanPi

In file included from /usr/local/include/c++/10.1.0/cstdint:35, from /opt/module/fastllm-master/include/fastllm.h:9, from /opt/module/fastllm-master/include/devices/cuda/fastllm-cuda.cuh:1, from /opt/module/fastllm-master/src/devices/cuda/fastllm-cuda.cu:9: /usr/local/include/c++/10.1.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must...

741075810

百川模型转换问题

6

ValueError: Can't find 'adapter_config.json' at 'hiyouga/baichuan-7b-sft'

liaoweiguo

python版测试约13ms/tokens

6

感谢作者开源这个很有价值的工作在我的测试中float16, int8, int4速度上没有明显的差异，大概均为13ms/token - 15ms/token 达不到报告中的176tokens/s，相当于5.68ms/token 我的硬件环境为: cuda11.8, A100 测试code如下，dtype可以调整为"float16", "int8", "int4" ```Python tokenizer = AutoTokenizer.from_pretrained("chatglm-6b", trust_remote_code=True) model = AutoModel.from_pretrained("chatglm-6b", trust_remote_code=True) model = llm.from_hf(model, tokenizer, dtype="int4") # 可以调整为"float16", "int8",...

nghuyong