(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation?
Functional discussion for this project. notebooks/llm-chatbot
Intel's official documentation: https://www.intel.com.tw/content/www/tw/zh/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html confirms support for Ollama.
In Ollama's GitHub documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md, it describes:
100% GPU: The model is fully loaded into the GPU. 100% CPU: The model is fully loaded into system memory. 48%/52% CPU/GPU: The model is split between the GPU and system memory. Ollama is powered by llama.cpp, which supports the --gpu-layers parameter to distribute model layers between VRAM and RAM, reducing GPU memory pressure.
However, when the CPU handles inference, the model is entirely loaded into RAM. Would it be possible for OpenVINO to introduce a parameter or functionality to support offloading model layers to SSD storage as temporary storage? This would reduce RAM usage, offering a more efficient way to handle resource-limited scenarios.
Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?
Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?
Hi, thank you for your response.
Does the HETERO mode allow RAM to be cached on an SSD to reduce RAM usage? If this functionality is not available, do you have any development plans to enable caching RAM on an SSD?
This issue will be closed in a week because of 9 months of no activity.