jamesljl
jamesljl
I'v encountered the same problem. The some steps above , no errors while merging and quantizing. just hangs when loading the model, before ">" appears. but the model seems loaded...
> Download the previous version : https://github.com/ggerganov/llama.cpp/releases the previous version don't work , even can't load 8-bit quantized model
I just reinstalled ubuntu vm and clone the latest version, re-compiled it. then it works. that's weird
如果要配置单机多 GPU,是否直接改为 os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3" 类似这样就可以了? 还是需要使用 nn.DataParallel 将模型 wrap 一下?
i ran into the same problem, I am using alpaca2-7B 8-bit quantized model : ggml-model-q8_0.bin , it behaved weird , lots of nonsense and the prompt seems not working.
o1 系列的模型适配都还没有完成,更别说 o3 了。o1模型不支持max_tokens 、system角色,听说还有stream下的 token。 感觉已经很久没有更新版本了。