ipex-llm Garbage output on serving 4 parallel users.

I started a server with the command OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ./ollama serve. We open 4 terminals and executed the command./ollama run codellama after which the model loaded. So now on 4 terminals we give the prompt>>write a long poem.and execute it simultaneously (four parallel requests). The output is garbage values. Screenshot_20240911_152331

Sep 11 '24 10:09 adi-lb-phoenix

Hi @adi-lb-phoenix, could you please provide your env and device config? In our test, ollama was able to run codellama as expected on MTL Linux.

Sep 13 '24 02:09 sgwhat

Hello @sgwhat . So I have installed podman and distrobox on kde neon, on which I have created a ubuntu distro using distrobox. Ipex llm is deployed inside the ubuntu distrobox. Inside ubuntu distrobox:

uname -a
Linux ubuntu22_ollama.JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

On the host system

Linux JOHNAIC 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The gpu is intel arc A770 GPU.

Sep 13 '24 07:09 adi-lb-phoenix

We are currently locating the cause of the codellama output issue on linux arc770 and will notify you as soon as possible.

Sep 14 '24 02:09 sgwhat

@sgwhat Thank you for picking this up. It has been observed not just for codellama but for other models as well.

Sep 14 '24 06:09 adi-lb-phoenix

https://github.com/ggerganov/llama.cpp/issues/9505#issuecomment-2352561991 Here llama.cpp does not output garbage values

Sep 16 '24 10:09 adi-lb-phoenix

When serving just one user Ipex llm has better speed than llama.cpp Result from ipex-llm:

llama_print_timings:        load time =    7797.13 ms
llama_print_timings:      sample time =      30.64 ms /   400 runs   (    0.08 ms per token, 13055.26 tokens per second)
llama_print_timings: prompt eval time =    1322.78 ms /    13 tokens (  101.75 ms per token,     9.83 tokens per second)
llama_print_timings:        eval time =   11301.98 ms /   399 runs   (   28.33 ms per token,    35.30 tokens per second)
llama_print_timings:       total time =   12711.93 ms /   412 tokens

Below is the result from llama.cpp

llama_perf_sampler_print:    sampling time =      31.73 ms /   413 runs   (    0.08 ms per token, 13015.66 tokens per second)
llama_perf_context_print:        load time =    4317.89 ms
llama_perf_context_print: prompt eval time =     456.68 ms /    13 tokens (   35.13 ms per token,    28.47 tokens per second)
llama_perf_context_print:        eval time =   22846.95 ms /   399 runs   (   57.26 ms per token,    17.46 tokens per second)
llama_perf_context_print:       total time =   23379.98 ms /   412 tokens

Sep 16 '24 11:09 adi-lb-phoenix