ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Nonsense output

Open ysaric opened this issue 11 months ago • 22 comments

Hobbyist only here, hoping for some help.

I'm running Windows 11, miniforge3, Ollama 0.5.1-ipexllm-20250123, Ubuntu 22.04, using Openwebui via Docker Desktop on the front. I have a Ryzen 9 5950X CPU, 64 GB system RAM, an a770 w/ 16GB VRAM.

Installed Ollama using instructions https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_quickstart.md and https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md

The output attached is using the model Deepseek-r1:7b but I can get similar symptoms with other models.

The symptom is that I am getting gibberish output from even fairly simple prompts. My experience has been that this is most often after my first question, or after switching models, but it is in fact quite random.

Attaching my miniforge prompt and screenshots of the open webui output.

Ollama serve.txt

Image

Image

ysaric avatar Jan 28 '25 17:01 ysaric

I am having the same problem. I started with a fresh install of Ubuntu 24.10 and followed the instructions for the docker install. I have tried a few different models of different sizes and they all either start as garbage data (random coding, random math, repeating the same word over and over, random characters, etc.) or they turned into it after a response or two. I have tried reinstalling Ubuntu to make sure nothing was corrupted but got the same result.

Ubuntu 24.10 Intel Arc B580 Ryzen 5 5600 48GB RAM (2x16GB + 2x8GB)

donldmn avatar Jan 29 '25 18:01 donldmn

Try loading the model via “ollama run deepseek-***”, with your model name of course (in a separate command line env that runs your ollama server). I don't know why openwebui loads the model like crap. But it helps, only after 3-4 replies it starts hallucinating anyway. But all new conversations work fine Damn, still runs one time normaly and then pretends to be braindead.

Yesterday worked fine. Maybe this is webui issues or ollama?

Nope, ollama throug CLI starts to hallucinate and calculate some physics stuff when I've asked to write a song. Mb that's only because old OLLAMA and new deepseek models(ipex-5.1 instead of 5.7?), but clearing context and asking again via CLI works fine for now, then it goes to delirium.

AlexXT avatar Feb 01 '25 10:02 AlexXT

Hi, we are reproducing this issue and will reply back to you soon.

sgwhat avatar Feb 06 '25 02:02 sgwhat

We are also having this issue while serving llama3.2:3b-instruct-q4_K_M with Ollama using the docker image intelanalytics/ipex-llm-inference-cpp-xpu:2.2.0-SNAPSHOT@sha256:d2f6320fa5506789c5c753b979ef0070a6ae3d4bf0860171b7f226a0afe89c59.

OS: Windows 11 24H2, WSL2 CPU/GPU: Intel Ultra 5 125H/Arc integrated GPU

It seems the garbled output will happen while the input exceeds the context length and gets truncated.

y1xia0w avatar Feb 07 '25 15:02 y1xia0w

Try loading the model via “ollama run deepseek-***”, with your model name of course (in a separate command line env that runs your ollama server). I don't know why openwebui loads the model like crap. But it helps, only after 3-4 replies it starts hallucinating anyway. But all new conversations work fine Damn, still runs one time normaly and then pretends to be braindead.

Yesterday worked fine. Maybe this is webui issues or ollama?

Nope, ollama throug CLI starts to hallucinate and calculate some physics stuff when I've asked to write a song. Mb that's only because old OLLAMA and new deepseek models(ipex-5.1 instead of 5.7?), but clearing context and asking again via CLI works fine for now, then it goes to delirium.

My experience is that many models work absolutely fine, some exhibit the exact same behavior as described and others (like llava:34b for me) always reply with gibberish. These may have different root causes, some might be template-related and broken outside of IPEX as well, but roughly about 30% of the ollama models I've tried do not work reliably. I'll try to start collecting systematic data on the issues.

vladislavdonchev avatar Feb 08 '25 14:02 vladislavdonchev

I have the same issue. My hardware is Intel Core Ultra 9 185H with integrated GPU. I have two copies of ollama : one use cpu, the other one use ipex-llm[cpp] with iGPU. I did various tests with deepseek r1 1.5b, 8b, qwen 1.8b:

(1), CPU version of ollama + open-webui (2), iGPU version of ollama + open-webui (3), iGPU version of ollama, but use curl command line to connect to ollama

Result : same behavior for all models : (1) is always correct. (2), (3) have non-sense output most of the time.

I could try to debug a bit. Any suggestions?

Edit : I tried to run deepseek r1 1.5b directly in python using AutoModelForCausalLM with ipex-llm[xpu]. The response of AI is correct. So I guess the issue is probably in ipex-llm[cpp].

suning-git avatar Feb 10 '25 02:02 suning-git

Hi @AlexXT @y1xia0w @suning-git @vladislavdonchev , for deepseek-r1 issue, you may setting num_ctx to a larger value as a workaround. Please follow the setps below.

  1. Create a file named Modelfile:
    FROM deepseek-r1:7b
    # you may set num_ctx into a large value
    PARAMETER num_ctx 8192
    
  2. Re-create the ollama model
    ollama create deepseek-r1-ctx-8k -f Modelfile
    

sgwhat avatar Feb 10 '25 07:02 sgwhat

I have the same issue. My hardware is Intel Core Ultra 9 185H with integrated GPU. I have two copies of ollama : one use cpu, the other one use ipex-llm[cpp] with iGPU. I did various tests with deepseek r1 1.5b, 8b, qwen 1.8b:

For deepseek-r1, you may refer to my last response. As for qwen 1.8b, I have reproduced your issue and working on fixing it.

sgwhat avatar Feb 10 '25 08:02 sgwhat

Hi @AlexXT @y1xia0w @suning-git @vladislavdonchev , for deepseek-r1 issue, you may setting num_ctx to a larger value as a workaround. Please follow the setps below.

Thanks for the workaround! Is this an issue in ipex-llm[cpp] or ollama? could you say very briefly what caused it? I plan to use ipex-llm in my code.

suning-git avatar Feb 10 '25 08:02 suning-git

Thanks for the workaround! Is this an issue in ipex-llm[cpp] or ollama? could you say very briefly what caused it? I plan to use ipex-llm in my code.

Ollama's default num_ctx is 2048, which may be not enough for deepseek‑r1.

sgwhat avatar Feb 10 '25 08:02 sgwhat

Hi @AlexXT @y1xia0w @suning-git @vladislavdonchev , for deepseek-r1 issue, you may setting num_ctx to a larger value as a workaround. Please follow the setps below.

  1. Create a file named Modelfile:
    FROM deepseek-r1:7b
    # you may set num_ctx into a large value
    PARAMETER num_ctx 8192
    
  2. Re-create the ollama model
    ollama create deepseek-r1-ctx-8k -f Modelfile
    

Hello, I am the OP above. Do you think this might help with my issue? If so, does it matter where I create the Modelfile, or run the command from? I assume the Modelfile is a text file where I've removed the file extension.

ysaric avatar Feb 10 '25 15:02 ysaric

I've been having the same issue, and I tried increasing the num_ctx to 8192 and still have the same issue. The first generation seems to work just fine, but the second generation goes off the rails.

I've attached a text file that shows what I did to get everything set up, as well as the results (it's a lot of text and I didn't want to paste it inline). Let me know if there's anything else I can do to help.

Intel Arc A770 16GB on Ubuntu 22.04.5

deepseek debugging.txt

wjr1985 avatar Feb 10 '25 16:02 wjr1985

OK, so, seems like I can reproduce this (or similar) issue with a few models and the attached prompt chain.

ollama.log

So far I've tried with Qwen2.5-Coder:7B, 14B & 32B and InternLM3:8B

It is always the same 3 prompts are fine and the 4th just keeps producing gibberish forever.

UPDATE: OK, I'm making the requests through Open WebUI, but the behavior seems to be different when using the UI and the API. Will investigate more.

vladislavdonchev avatar Feb 10 '25 22:02 vladislavdonchev

I've been having the same issue, and I tried increasing the num_ctx to 8192 and still have the same issue. The first generation seems to work just fine, but the second generation goes off the rails.

I've attached a text file that shows what I did to get everything set up, as well as the results (it's a lot of text and I didn't want to paste it inline). Let me know if there's anything else I can do to help.

Intel Arc A770 16GB on Ubuntu 22.04.5

deepseek debugging.txt

Which version of the Docker image are you using? You may refer to docker guides. I tested deepseek-r1:7b, and it works fine.

sgwhat avatar Feb 11 '25 02:02 sgwhat

Hello, I am the OP above. Do you think this might help with my issue? If so, does it matter where I create the Modelfile, or run the command from? I assume the Modelfile is a text file where I've removed the file extension.

I think this can help you resolve the issue with the Ollama CLI, but your issue with Open-WebUI is still being worked on. The location of the Modelfile is not important, it is recommended to place it in the same directory as your ipex-llm ollama installation.

sgwhat avatar Feb 11 '25 02:02 sgwhat

I'm on an Intel Alder Lake iGPU and often get gibberish/unrelated output.

ghost avatar Feb 11 '25 03:02 ghost

UPDATE: So, yeah, it's definitely something with Open WebUI / calling the Open WebUI API. I've managed to create some code that reproduces the issue every time on my setup:

def prompt_model(
        message: str,
        chat_id: str | None,
        model: str,
) -> (str, str):
    global CHATS

    # 1) Authenticate
    authenticate()

    chat_data = {}
    assistant_message_stub_id = None

    #
    # 2) Call /api/chat/completions
    #
    completion_id = str(uuid.uuid4())
    completion = call_completions(
        chat_id=chat_id,
        completion_id=completion_id,
        messages=[{"role": "user", "content": message}],
        model=model
    )
    assistant_message = completion["choices"][0]["message"]["content"]

    #
    # 3) Return final assistant message
    #
    return assistant_message, chat_id


def call_completions(chat_id: str, completion_id: str, messages: List, model: str) -> Dict:
    """
    Calls /api/chat/completions and returns task_id.
    """
    headers = {
        'Authorization': f'Bearer {API_TOKEN}',
        'Content-Type': 'application/json',
        'Referer': f'http://127.0.0.1:8080/c/{chat_id}'
    }

    # For minimal compliance, the server typically just needs:
    # - model
    # - messages (the user message, maybe partial history)
    # - chat_id
    # - session_id, id (some unique IDs)
    # - background_tasks, features, etc.
    completions_body = {
        "stream": False,
        "model": model,
        "params": {},
        "messages": messages,
        "features": {"web_search": False, "code_interpreter": False, "image_generation": False, },
        "chat_id": chat_id,
        "id": completion_id,
        "background_tasks": {
            "title_generation": False,
            "tags_generation": True
        }
    }

    resp = requests.post(f"{OPEN_WEBUI_BASE_URL}/api/chat/completions", headers=headers, json=completions_body)
    resp.raise_for_status()

    return resp.json()

I have tested with different num_ctx and num_predict and there is no difference in the observed behavior. No other values / parameters have been changed for the models.

Calling the ollama API directly using the ollama python package works as expected and without issue. Repeatedly calling the models through Open WebUI seems to work OK... Most of the time. My script however, always fails on the 3rd/4th prompt.

vladislavdonchev avatar Feb 11 '25 10:02 vladislavdonchev

In my case, I reproduce this by using the ollama-intel-arc docker image and Alpaca, a gnome app that allows to connect to the running ollama session. It seems increasing context size num_ctx and -e OLLAMA_MAX_LOADED_MODELS=1 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_NUM_GPU=999 -e DEVICE=iGPU in the environment variables helps. At least it did for me, so far.

ghost avatar Feb 11 '25 17:02 ghost

Guys... I think I finally found the culprit. Seems to be related to Open WebUI / ollama batching when multiple requests are made and it's more easily reproducible with small context sizes (such as with default models - I guess the generation runs out of tokens faster then).

You can reliably reproduce the behavior like I've done in this video: https://www.youtube.com/watch?v=KR2nKc-hT1M

vladislavdonchev avatar Feb 11 '25 20:02 vladislavdonchev

explains why it was fixed for me by using OLLAMA_MAX_LOADED_MODELS=1 and setting num_ctx higher on the models I use helped.

I wonder if this can be fixed in the drivers

Sent from Proton Mail Android

-------- Original Message -------- On 2/11/25 3:26 PM, vdonchev wrote:

Guys... I think I finally found the culprit. Seems to be related to ollama batching when multiple requests are made and it's more easily reproducible with small context sizes (I guess the generation runs out of tokens faster then).

You can reliably reproduce the behavior like I've done in this video: https://www.youtube.com/watch?v=KR2nKc-hT1M

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

ghost avatar Feb 11 '25 20:02 ghost

explains why it was fixed for me by using OLLAMA_MAX_LOADED_MODELS=1 and setting num_ctx higher on the models I use helped.

I wonder if this can be fixed in the drivers

Sent from Proton Mail Android

Well, first step would be to get batching working with ollama manually and see if that happens, but I'm working on something else right now, so let's see if someone from the crew picks it up.

vladislavdonchev avatar Feb 11 '25 20:02 vladislavdonchev

was this ever fixed? getting nonsense after a couple prompts using phi4-mini on i5 1235u. first couple prompts work fine, and then nonsense.

bobloadmire avatar Nov 16 '25 19:11 bobloadmire