serge
serge copied to clipboard
š [Bug]: New install - response keeps repeating the last line
Bug description
I just pulled the image, spun up a container with default settings. I downloaded the Mistral-7B model, and left everything default. I've tried a few short questions, and the answer repeats the last line until I stop the container.
Steps to reproduce
- Spin up new container with default settings (from repo)
- Download Mistral-7B
- Start a new chat and ask "what is the square root of nine"
Environment Information
Docker version: 25.0.3 OS: Ubuntu 22.04.4 LTS on kernel 5.15.0-97 CPU: AMD Ryzen 5 2400G Broswer: Firefox version 123.0
Screenshots
Relevant log output
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 4165.37 MiB
...............................................................................................
llama_new_context_with_model: n_ctx = 2153
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 269.13 MiB
llama_new_context_with_model: KV self size = 269.12 MiB, K (f16): 134.56 MiB, V (f16): 134.56 MiB
llama_new_context_with_model: CPU input buffer size = 12.22 MiB
llama_new_context_with_model: CPU compute buffer size = 174.42 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-v0.1', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...
Received termination signal!
++ _term
++ echo 'Received termination signal!'
++ kill -TERM 18
++ kill -TERM 19
18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...
18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...
Confirmations
- [X] I'm running the latest version of the main branch.
- [X] I checked existing issues to see if this has already been described.
Hello, I have the same bug when using Mistral or Mixtral for text generation. It keeps repeating the last sentance over and over till I restart the container. I tried increasing the repeat penalty but it does nothing.
I've noticed this for most, if not all models I can test. This bug essentially makes serge useless. Update Reverting to "ghcr.io/serge-chat/serge:0.8.2" appears to vastly improve or eliminate the repeating issue altogether. Still testing.
This is probably a bug in llama-cpp-python. I will update it this week and do a new release.
Which specific model are you all using? @SolutionsKrezus @fishscene
I'm currently using Mistral 7B and Mixtral @gaby I reverted to 0.8.0 and it works like a charm
This is probably a bug in llama-cpp-python. I will update it this week and do a new release.
Which specific model are you all using? @SolutionsKrezus @fishscene
Apologies, Iām at work at the moment. All models I tested were affected to some degree. Some more than others.
Off the top of my head: All current mixtral models, at least 2 mistral models, neural chat, one of the medical ones, definitely a few more as well. I did not test anything above 13b as those are beyond my hardware.
I would see random replies marked/flagged as code snippetsā¦ and if the model started repeating itself, that was the end of anything useful as all subsequent replies would only repeat.
Of all the testing I did, getting 10 coherent replies was a major milestone- and even then, sometimes it took multiple re-prompting (delete my query and ask it slightly differently) to get to 10. A couple models started spewing nonsense and repeats on the very first response.
All this to say, testing should be very easy to do. When I reverted to previous serge release, I immediately saw improvement.
Curious though. OP is using Ryzen- so am I: Ryzen 1700x, 32GB RAM, no CUDA GPU. (NVIDIA T400 I think). Using CPU for AI.
Maybe this is isolated to Ryzen CPUās?
Another behavior to note: When asking some censored models a question, they straight up have no reply at all. No detectable CPU was used either. It was like some pre-AI function was like ānopeā and didnāt pass along my query to the AI model itself. Thereās a name for this pre-process, but it escapes me at the moment. Not sure if it is a clue either.
I don't think it is a Ryzen-related issue @fishscene I have the same problem with a Intel Xeon D-1540 with 32GB RAM and no GPU.
Same issue here. This pretty much renders the software completely useless :(