micronetboy
micronetboy
Qwen-14B-Chat-Int4 GPTQ model using vLLM is not acceralated. use the following vllm_wrapper: https://github.com/QwenLM/vllm-gptq/blob/main/tests/qwen/vllm_wrapper.py
**nllb 3.3B** translate from Chinese to Korean got: , , , , , , , , , , , , , , , **source_text**: 哈哈哈哈哈 说的什么玩意儿呀 这个声音太尖了 耳朵有点受不了哎 t**ranslated_tex**t: ,...
**nllb 3.3B** translate from Chinese to Japanese got: () () () () () () () () () () () () () () () () () () () () () ()...
Why the nllb 3.3B will still occupy 5.7GB memory while the model has loaded to GPU and occupied 13.17GB GPU-memory? In my opinion, when the model load to GPU, the...
How could we accelerate NLLB, can we use TensorRT or vLLM?
您好 我是使用的是聊天模型:RWKV-4-Raven-7B-v12-Eng49%-Chn49%-Jpn1%-Other1%-20230530-ctx8192.pth。 使用的API_DEMO.py resp = pipeline.generate(prompt, token_count=100, args=args, callback=None) 我遇到的问题是回复被截断了,没有自然停止。我希望他的回复在 100 字之内。这个问题如何解决。 在最后回复到“我有“明显是被截断了。 我的输入是: Here is some information about 兔仔:\n---\n姓名:兔仔。性别:男。年龄:18岁。身份:同龄好友,哥们知己。背景:高中生。外貌:身高一米六八,喜欢穿名牌运动服、运动鞋。过往经历:与用户本人是同学关系,从小一起长大的发小,在相同的学校上学,有相同的兴趣爱好\n性格:天真快乐,灵活灵巧,活在当下,享受生活。喜好:喜欢社交,喜欢凭借直觉判断事物,适应能力强。厌恶:讨厌计划性的事物。\n---\n\nI want you to act like 兔仔.\nYou are now cosplay 兔仔.\nI...
**Describe the bug** Qwen-14B-Chat-Int4 GPTQ model is slower than original model Qwen-14B-Chat greatly. **Hardware details** A100 80G **Software version** Version of relevant software such as operation system, cuda toolkit, python,...
Qwen-4B-Chat Lora 微调 14B 模型后,转GPTQ 量化模型后,vLLM方式运行,有5%的概率,返回为 空字符串 未量化前的模型没有这个问题。 使用 readme 里的脚本量化。 谢谢
### System Info Nvidia A100 PCIe 80G ### Who can help? none ### Information - [X] The official example scripts - [X] My own modified scripts ### Tasks - [X]...
[ 98%] Built target flags_parse [ 99%] Linking CXX static library lib/libqwen.a [ 99%] Built target qwen [ 99%] Building CXX object CMakeFiles/main.dir/main.cpp.o [ 99%] Building CXX object CMakeFiles/_C.dir/qwen_pybind.cpp.o In...