ipex-llm intelanalytics/ipex-llm-inference-cpp-xpu:2.2.0 docker image causes memory issue with intel arc a380

Hey. Not a computer scientist here, but thought you guys'd like to know that the latest pushed container image is causing issues with gpu inference for me.

System specs CPU: AMD Ryzen 3600 GPU: Intel arc a380 RAM: DDR4 ECC RAM unregistered 3200mhz single channel 16gb OS: Debian 12 Kernel: 6.7.12+bpo-amd64 Docker version 27.2.0, build 3ab4256

logs attached. Logs_Latest.txt Logs_2.1.0.txt

Sep 02 '24 18:09 bobsdacool

Native API failed. Native API returns: -6 (PI_ERROR_OUT_OF_HOST_MEMORY) -6 (PI_ERROR_OUT_OF_HOST_MEMORY) .This looks like an OOM error. You can try a smaller model such as dolphin-phi:latest.

Sep 03 '24 01:09 hzjane

Hi, yes, a smaller model does work for me on the latest container ~0.3Gb. I think there is an issue though as using version 2.1.0 allows me to use models that match the systems vram ~6gb. Even when I have all other docker containers shut down, when using the new container with ~14Gb of free system memory this error persists. It's possible this is an error in detection of sycl devices as the latest container does not pick up on the CPU either. Although I get high cpu core usage when doing inference on version 2.1.0 using htop, I can also see that hardware acceleration is being utilized by monitoring the GPU usage using intel_gpu_top. I'm not sure how much this means to you. It was working in the previous container, but can't get it to work in 2.2.0+ sticking with 2.1.0 for the time being.

Sep 03 '24 07:09 bobsdacool

I don't really know what problem you meet again? Do you mean that this problem exists in the latest 2.2.0 version, and the 2.1.0 is normal? But the docker image is basically not updated between 2.1.0 and 2.2.0. I have tested 2.2.0-snapshot on Arc A770 and no meet any OOM problem. Maybe it's caused by the VRAM different from A380 6GB and A770 16GB?

Sep 04 '24 01:09 hzjane

Hi, yes, whilst I can run llms at like 5Gb in size in 2.1.0 I cant run them in 2.2.0 with the exact same docker setup. I can run much smaller llms in 2.2.0 so the ollama functionality is not totally bust, there does seem to be a memory issue.

I'm not sure where the issue lies though. Please let me know if there is any other system information that you'd like me to collect to help get to the bottom of this.

Sep 04 '24 07:09 bobsdacool

Thanks for your question. There was indeed a llama.cpp/Ollama upgrade between image 2.2.0 and 2.1.0, which may be the root cause. We will confirm the issue again. And You can run it with 2.1.0 first.

Sep 05 '24 01:09 hzjane

Hi @bobsdacool , in your log it says your n_ctx = 8192. This is because the latest ollama upstream has a default setting of OLLAMA_NUM_PARALLEL=4, which sets the total space allocated for context n_ctx to 4*2048, 2048 being the model's default context space. Try running export OLLAMA_NUM_PARALLEL=1 before you start ollama serve. If the problem persists, you may manually create a Modelfile and set the model's num_ctx smaller, eg.

FROM llama2
PARAMETER num_ctx 512

then load model with:

ollama create llama2:latest-nctx512 -f Modelfile
ollama run llama2:latest-nctx512

Sep 05 '24 02:09 ada-jt1725