gpt4all icon indicating copy to clipboard operation
gpt4all copied to clipboard

High CPU usage in chat with Hermes 2 Pro Mistral 7B after generation has finished

Open brankoradovanovic-mcom opened this issue 11 months ago • 7 comments

Bug Report

With Hermes 2 Pro Mistral 7B, in certain situations chat.exe causes high CPU usage even after generation has finished.

Steps to Reproduce

  1. Install Hermes 2 Pro Mistral 7B (SHA256 checks out!)

  2. Set the context length to 8192. This is OK because the model supports 32k. The device is CPU, with 8 threads. (These settings may or may not be important here.)

  3. Open the chat, load the model and enter e.g. the following prompt:

Write a fictional story titled "A Triangle", with the following synopsis:

"An observer becomes entranced by a seemingly ordinary couple on the street, follows them home, and then
watches them from outside in the rising floodwaters, drawing an eerie connection between the woman and
a discarded, burned chair they’d noticed earlier."

Make sure the story is at least three pages long.

The actual prompt is irrelevant, as long as the response is long enough (c. 700 words is sufficient - not that long actually).

  1. The model will respond with a story (or, at any rate, something that resembles it).

  2. Once the generation has finished, the CPU usage stays high as if the generation is still running, instead of dropping to 0. It's due to chat.exe, according to Task Manager.

  3. Upon closing the chat window, chat.exe stays active in the background and high CPU usage continues, until the process is killed in Task Manager.

I haven't seen this problem with other models, so it's kind of a narrow issue, but might be worth a look.

Your Environment

  • GPT4All version: 2.7.3
  • Operating System: Windows 10
  • Chat model used (if applicable): Hermes 2 Pro Mistral 7B

brankoradovanovic-mcom avatar Mar 26 '24 08:03 brankoradovanovic-mcom

I can confirm this.

devSJR avatar Apr 03 '24 07:04 devSJR

I confirm a similar (or the same) behavior, on v2.7.3, a Context Length of 30720, and Win10.

SINAPSA-IC avatar Apr 06 '24 21:04 SINAPSA-IC

This might be related to an issue with stop token, described in #2239.

brankoradovanovic-mcom avatar Apr 22 '24 12:04 brankoradovanovic-mcom

Had the same problem on a couple of occasions while running Coxcomb. There are certainly more models that exhibit the same behavior, but whether that has something to do with stop tokens or not is unclear.

brankoradovanovic-mcom avatar May 10 '24 08:05 brankoradovanovic-mcom

This is what Hermes 2 Pro Mistral generates if the code is modified to fix an infinite loop on unrecognized special tokens, and to enable printing of special tokens: hermesprotrash

It generates </s> and then gibberish from then on. At least with this version, it eventually stops instead of hanging, but clearly the stop token specified in the generation config (<|im_end|>) is not the only stop token it has been trained on - it will generate </s>, but has no clue what should follow </s> as it has never seen anything after it in training data.

The stop token can be a list (see llama 3), so ideally everyone would use that correctly when uploading HF models, and the llama.cpp conversion scripts would pick that up (currently not implemented). But it seems like at least right now this is rarely done in practice, so the only robust solution is a user-customizable list of stop tokens. At that point we may as well implement customizable stop sequences - there is only a hardcoded list of "reverse prompts" at the moment, all for Alpaca.

cebtenzzre avatar May 16 '24 22:05 cebtenzzre

I am experiencing similar issues - Win11, v2.7.5, 13th Gen i7. Context length is set to 4096 and the program freezes after a lengthy response until terminated by Task Manager. Not sure if this is just an issue with my hardware because RAM-wise it's only using 14 out of 32GB.

jntm7 avatar May 16 '24 23:05 jntm7

The stop token can be a list (see llama 3), so ideally everyone would use that correctly when uploading HF models, and the llama.cpp conversion scripts would pick that up (currently not implemented). But it seems like at least right now this is rarely done in practice, so the only robust solution is a user-customizable list of stop tokens. At that point we may as well implement customizable stop sequences - there is only a hardcoded list of "reverse prompts" at the moment, all for Alpaca.

Would this mean that, at least with Python bindings, the issue may be taken care of by using stop_on_token_callback()?

brankoradovanovic-mcom avatar May 17 '24 07:05 brankoradovanovic-mcom