Junyang Lin

Results 173 comments of Junyang Lin

scores are processed logits and I think you should directly get `output["logits"]`. Check if it works, and see https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py for better understanding.

for the installation of auto-gptq, we advise you to install from source (git clone the repo and run `pip install -e .`) or you will meet "CUDA not installed" issue.

Stay tuned for our coming tech report. Temporarily we do not release details about this

Models without `-chat` in their names do not serve for chatting. In fact, the base models are usually for finetuning. It is totally trained by next token prediction on large-scale...

I suspect that you are using base model instead of chat model? Use Qwen1.5-14B-Chat and follow the example code

Next week I'll provide an instruction. You can take a look at `model.chat()` in our previous Qwen code and see if you can do it yourself.

> 多卡可能要加这个参数 --tensor-parallel-size,我用了没报oom的错了,但是有其他cuda错误 Yeah you need this for tensor parallelism to deploy the large model on multiple devices.

sry temporarily we are not about to release the details. stay tuned for the coming tech report.

it is about the quantization, q6 you can regard it as 6 bit quantization, q2 you can regard it as 2 bit quantization. for sure, fp16 / bf16 should perform...