openvino_notebooks icon indicating copy to clipboard operation
openvino_notebooks copied to clipboard

(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation?

Open hsulin0806 opened this issue 1 year ago • 3 comments

Functional discussion for this project. notebooks/llm-chatbot

Intel's official documentation: https://www.intel.com.tw/content/www/tw/zh/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html confirms support for Ollama.

In Ollama's GitHub documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md, it describes:

100% GPU: The model is fully loaded into the GPU. 100% CPU: The model is fully loaded into system memory. 48%/52% CPU/GPU: The model is split between the GPU and system memory. Ollama is powered by llama.cpp, which supports the --gpu-layers parameter to distribute model layers between VRAM and RAM, reducing GPU memory pressure.

However, when the CPU handles inference, the model is entirely loaded into RAM. Would it be possible for OpenVINO to introduce a parameter or functionality to support offloading model layers to SSD storage as temporary storage? This would reduce RAM usage, offering a more efficient way to handle resource-limited scenarios.

hsulin0806 avatar Nov 19 '24 01:11 hsulin0806

Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?

brmarkus avatar Nov 19 '24 07:11 brmarkus

Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?

Hi, thank you for your response.

Does the HETERO mode allow RAM to be cached on an SSD to reduce RAM usage? If this functionality is not available, do you have any development plans to enable caching RAM on an SSD?

hsulin0806 avatar Nov 25 '24 01:11 hsulin0806

This issue will be closed in a week because of 9 months of no activity.

github-actions[bot] avatar Aug 27 '25 00:08 github-actions[bot]