llama.cpp
llama.cpp copied to clipboard
mmap() - backed istream implementation
This is for issue #91.
Treat this as a first draft. There are definitely some thing that need to be changed and will be changed shortly.
I have not benchmarked. I'm submitting the PR early to get feedback.
Questions/Observations for discussion:
- Should
llama_istream
be a part ofutils.cpp
? -
MADV_SEQUENTIAL
is probably wrong, as brought up in #91. I'll benchmark to make sure, then fix that before it's merged. - When I wrote
llama_istream
, the intent was that you wouldn't have to checkUSE_MMAP
at the site where it's constructed. https://github.com/ggerganov/llama.cpp/pull/43 throws a wrench in that. I could create a quick wrapper class, rather than usingusing
. It might be a little confusing when people merge though. - What I'm doing is object-oriented, which doesn't fit the coding style very well. However,
std::istream
's copy constructor is explicitly deleted. So I can't get it out of a function except as an rvalue as the return value (or a global variable which I think would be worse), and the convention elsewhere is that the return value is for the error. - Subclassing
istream
was WAY more painful than I expected. For some reason, if I don't reimplementseekoff()
andseekpos()
this way,tellg()
seems to return garbage. Couldn't tell you why. Nor can I find any hint of why on the internet. The signature is also very finicky. If you have any insight into why it needs to be exactly this way, let me know. - I'm not completely sure which
mmap()
arguments to use. If you have a better suggestion, let me know.
My comment about MADV_SEQUENTIAL
was assuming you were trying to implement zero-copy approach and have the inference-time code directly use the model from the mapping rather than still copying everything into private memory first (to be fair this is easier said than done if the file format is not intended to be used in this way, you would likely run into alignment issues)
Its probably not as important and may actually help here since you do read most of the data in the mmap region, at least the weights, only once (it does seem like it scans through the file a few times skipping weights to do the name-based lookups, but that's not a huge deal)
I'm also not sure much is gained by creating an mmap
-backed istream
- if you are going to copy the data anyway mmap
doesn't have many advantages over normal read
I/O and adds significant complexity - for large reads mmap
can mean many more context switches than regular I/O, just less visible ones as they're triggered by the MMU instead of explicitly by user code.
Generally the advantages of mmap
ing files are that you can avoid copying, be that from the file into private process memory (you can just read the exact memory the kernel originally read the file into), among processes (multiple processes mapping the same file also hit the same physical memory, except in the case of MAP_PRIVATE
), or across launches (your process exiting doesn't immediately evict everything from the kernel page cache, the next launch will get the same underlying memory if it hasn't been evicted - this one especially is lost if you have to copy anyway since having a second in-memory copy of the model in private process memory will cause the kernel to have to evict more things from the page cache)
Thank you for sending us this change! This change would result in an 18% speedup of loading in the average case, however what @apage43 says is true. It doesn't make sense to use mmap() as a read() replacement, because the true power of mmap() is it gives us the ability to not read the file at all. I've just written a blog post for a command line utility I wrote here: https://justine.lol/rusage/ which uses what we've done here as an example use case, and at the bottom adds further clarity to these points.
We need to go back to the drawing board on this. An mmap() change should reduce startup latency not by 18%, but by 100%. In order to make that change, we need to find a way to have the tensor data structures that exist on disk be the same as what llama.cpp needs at runtime in memory. I propose that we discuss our ideas for how we might do this in #91.
Thank you again for your contribution. I hope you'll continue to keep working with us on the issue, while we figure this out!