llama.cpp Feature Request: Use direct_io for model load and inference

Feature Request: Use direct_io for model load and inference

Open jagusztinl opened this issue 6 days ago • 0 comments

Prerequisites

[x] I am running the latest code. Mention the version if possible as well.
[x] I carefully followed the README.md.
[x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

By this article using unbuffered reads could speed up modell load and big modell inference on memory constrained servers: https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826

Motivation

Modell load times could be significact for big modells like R1

Possible Implementation

Use O_DIRECT flag for modell load

Feb 16 '25 17:02 jagusztinl

llama.cpp llama.cpp copied to clipboard

Feature Request: Use direct_io for model load and inference

Prerequisites

Feature Description

Motivation

Possible Implementation

llama.cpp
llama.cpp copied to clipboard