text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Truncated responses from mistralai_mixtral-8x7b-instruct-v0.1 / llama.cpp

Open jin-eld opened this issue 10 months ago • 6 comments

Describe the bug

I am seeing truncated replies from the mistralai_mixtral-8x7b-instruct-v0.1 model and I did not manage to find a clear pattern on when this is happening. Sometimes it will stop too early towards the end of a longer reply, but sometimes it will stop right after 3-4 lines of text, so this does not seem to be related to the number of characters in the response.

The model does have the answer in the context, so asking it to repeat everything it wrote after the truncated line is a workaround, yet it is still quite annoying.

I tried tuning the parameters, but I am not sure if I got them right, to be honest I was not able to notice a clear change in behavior. In the model loading tab I have n_ctx set to 32768. In the parameters tab I used the "Divine Intellect" preset, I tried tuning max_new_tokens to 1024, but it did not seem to make a difference. Sometimes the answers were just truncated at seemingly random points (i.e. some after more lines, some after just a few lines).

The truncation does not happen every time, but often enough in a conversation.

Is there any specific info which I could provide to figure out what is happening?

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Grab mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf from TheBloke at Huggingface, load it with the llama.cpp backend and instruct it to write some code. You will consistently get answers that are truncated. The model will repeat cut-off parts if asked to do so, usually they get printed in full.

Screenshot

No response

Logs

There is no error output except for the usual timing infos dumped by llama.cpp, please let me know if I can enable some additional traces which would help to get to the bottom of this.

System Info

Fedora 40 (pre release Rawhide)
ROCm 6.0.0
GPUs: 2x MI25 Instinct cards (gfx9)
text-generation-webui: commit 1934cb61ef879815644277c01c7295acbae542d8 (from Sun Mar 10 23:39:20 2024 -0300)
pip lists llama_cpp_python as: 0.2.56

jin-eld avatar Mar 26 '24 19:03 jin-eld

I have the same issue.

zhenweiding avatar Mar 27 '24 00:03 zhenweiding

Did you try Q4? As I know Q3/Q5 was broken when it was created.

berkut1 avatar Mar 27 '24 01:03 berkut1

Q4 is OK,Q8 has the same issue.

zhenweiding avatar Mar 27 '24 01:03 zhenweiding

But, Q5 on ollama is OK.

zhenweiding avatar Mar 27 '24 01:03 zhenweiding

Did you try Q4? As I know Q3/Q5 was broken when it was created.

No, not yet, I am on a slower connection with limited traffic, so "trying" too many models is not always possible :(

Broken - @berkut1 do you mean during conversion to gguf or during quantization or at some other step, in other words: where should the issue be filed in this case?

Then again, @zhenweiding says Q5 does OK with ollama, so I am now confused if it's a model or a backend issue? perhaps I should ask in llama.cpp as well...

jin-eld avatar Mar 27 '24 10:03 jin-eld

Broken - @berkut1 do you mean during conversion to gguf or during quantization or at some other step, in other words: where should the issue be filed in this case?

When it was created, everyone had problem with q3/q5. I dunno if it was problem with quantization or on llama side. If ollama has no issue, then the problem probably with https://github.com/abetlen/llama-cpp-python You also can check it with https://github.com/LostRuins/koboldcpp which is based on llama.cpp too.

berkut1 avatar Mar 27 '24 21:03 berkut1

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar May 26 '24 23:05 github-actions[bot]