llama.cpp MMAP for Windows (not working atm)

The code in PR lets you run llama the first time, but the second time the program crashes. This is due to memory access violation when trying to access any data related to the model or vocab. I don't know why, but current implementation cannot serialize pointers in the magic properly. I suspect this is due to allocating them using new operator which generates addresses outside of mapped memory range, guess on the second run those pointers are simply not interpretable and point to nowhere.

Major changes:

custom malloc renamed to _malloc. Otherwise linker is complaining when linking against CRT as there are multiple definitions for malloc. I tried to do undef malloc to remove references for the original one, but did not verify yet if it makes any difference.
WinMap fix, set access flag to FILE_MAP_COPY when loading second time. See https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-mapviewoffile.
Implement madvise for win. I can't verify if this change makes any difference as the pointers serialization is broken as of now, but it seems to be working.
Implement msync for win. Same comment as for madvise.

Some of the fixes to make it compile include:

#define NOMINMAX
using appropriate stat structure for win.
define MS_ASYNC for win.

This PR code will likely not compile on Linux right now (and it's also quite crappy as I did rapid changes just to make it compile at least). So if there are people using Visual Studio, I hope this PR will lay some foundation for further development.

Mar 20 '23 22:03 oKatanaaa

Also I think it's time to add commentaries for the code (mmap specifically), as it already becomes stupidly complicated and has a lot of magic constants. It simply becomes harder to contribute and maintain.

Mar 20 '23 22:03 oKatanaaa

... Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in https://github.com/ggerganov/llama.cpp/issues/278#issuecomment-1475652120

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

Mar 20 '23 22:03 anzz1

... Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

You mean something like importing https://github.com/alitrack/mman-win32 ?

Mar 20 '23 23:03 niclimcy

You mean something like importing https://github.com/alitrack/mman-win32 ?

Nope, quite the opposite, steering clear of any non-portable code, imported libraries or dependencies inside the main program, and have any functionality like this rather implemented as a module which would live outside the main program. Being easily separated from the main code and used (or not used) by adding a compiler flag and including the module's .cpp file in the project.

Explained in more detail in here: https://github.com/ggerganov/llama.cpp/issues/23#issuecomment-1477080912

Mar 21 '23 00:03 anzz1

cmake script is broken, no longer detects msvc compiler

Mar 21 '23 01:03 niclimcy

... Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

@anzz1 This PR is not about multiprocessing or sharing memory but rather accelerating the loading of the model via a memory mapped file (see #91 for more details). Though I do agree with your point that all the fancy stuff should be kept outside of main to make everythin portable/maintainable.

Mar 21 '23 12:03 oKatanaaa

@nicknitewolf I looked into mman-win32 and tried to use the source code. Unfortunately it doesn't work. It is pretty similar to what @jart had already written but breaks without Justine's tweaks. I tried to adapt it to match the implementations, but no luck.

Sadly I'm out of options at this moment, not enough expertise for this kind of stuff. Unless there will be a person knowing mmap on win who can fix this, the only option is to fall back to default load on Windows.

Mar 21 '23 12:03 oKatanaaa

@anzz1 This PR is not about multiprocessing or sharing memory but rather accelerating the loading of the model via a memory mapped file (see #91 for more details). Though I do agree with your point that all the fancy stuff should be kept outside of main to make everythin portable/maintainable.

Yes, but the goal of accelerating the loading via a memory-mapped file is the ability to preload a model to memory, thus sharing it. Let's not argue about semantics :D

In any case, a C-style API is now being implemented: https://github.com/ggerganov/llama.cpp/pull/370 This will make implementing this way cleaner and easier as it can be implemented as a module and the default crt memory funcs do not need to be replaced. There are now llama_init_from_file and llama_free functions!

Mar 21 '23 21:03 anzz1

I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in https://github.com/ggerganov/llama.cpp/pull/564 and the specific make command below: make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'

Mar 28 '23 01:03 CoderRC

I've cherry-picked this pull request onto the pushed I've just made to the mmap branch. Merging this on GitHub is no longer necessary. Thank you for your contribution!

Mar 28 '23 17:03 jart