llama.cpp ggml : refactor ggml-cpu.c into multiple C++ source files

As per recent discussions (e.g. https://github.com/ggerganov/llama.cpp/pull/10144#pullrequestreview-2411814357), we should split the large ggml-cpu.c implementation into smaller modules - similar to how the CUDA backend is organized. We should utilize ~C++11~ C++ to reduce code duplication.

Nov 05 '24 07:11 ggerganov

@ggerganov @slaren Any thoughts on bumping C++11 to C++20? I'd love to enable things like std::span (C++20), std::string_view (C++17) and other C++17 features from atomic and thread libraries (shared_mutex, hardware_interference_size, etc) since they make the code a lot cleaner while keeping it efficient.

We could keep public API headers C11/C++11 only and build ggml/llama.cpp internally with C++20 enabled, and maybe update contrib guidelines to explicitly state that C++20 is enabled internally but only widely supported features are allowed.

C++20 support is fairly ubiquitous at this point. It's often shown as partial but mostly due to the modules and a few tricky features, all the core bits have been supported for a while now. https://en.cppreference.com/w/cpp/compiler_support#C.2B.2B20_library_features https://clang.llvm.org/cxx_status.html https://gcc.gnu.org/projects/cxx-status.html

Nov 05 '24 18:11 max-krasnyansky

My opinion is that we should bump the standard only when we really need some functionality. For example, we are switching from C to C++ because the higher-level logic in ggml can really benefit from having dynamic containers. We will also try to utilize templates to reduce the code for the ops implementation as a function of the data types, since it is becoming very cumbersome and difficult to maintain large chunks of copy-pasted code. But overall, we should be mindful about the C++ that we use and try to keep things minimal. It's easy for example to slip some extra unnecessary memory allocations by using the wrong containers, or respectively make the code too difficult to understand if templates are overused.

The std::span, std::string_view, etc. don't seem to be something that is really needed in the codebase atm, so I don't think it justifies to bump the standard for mostly QoL improvements. If we do that, the code will likely become less consistent due to being able to do the same thing in different ways. Additionally the compile-time tends to explode if we are not careful with the C++ features and headers that we use.

Nov 05 '24 20:11 ggerganov

I think there are good reasons to bump the C++ standard, there are very useful features in C++14 and C+++17 that would help make the code simpler. We already have problems due to inconsistency in the coding style and features used in different parts of the code (mostly in llama.cpp), and I don't expect that bumping the standard would make it significantly worse. Ultimately, this is a matter of enforcing a code style by writing a coding standard and just not merging code that does not follow it. You can already write very hard to read template code in just C++98, there is no need to bump the standard for that. Probably more so, since the lack the newer features forces the use of things like SFINAE.

However, I would expect that C++20 would break the build for a lot of people. Support is good in recent compiler versions, but many people are using very old environments, and telling them to just update their compiler is not always a option. We have workarounds to support very old versions of gcc because of this. Some people are using llama.cpp in very old systems and we should be careful about not breaking it for them. C++14 or 17 should be safer, though.

Nov 05 '24 20:11 slaren

Sounds good. Looks like Diego and I are mostly on the same page already (C++17 as the default works for me). Once we start refactoring the CPU backend I'll try to share some specific examples that benefit from latest features to revisit this and convince Georgi.

Nov 06 '24 05:11 max-krasnyansky

@max-krasnyansky I'm OK to bump to C++17 - @slaren's opinion is good enough for me. I only hope we will be vigilant and utilize just the relevant features for our needs.

Nov 06 '24 08:11 ggerganov

c++11 is ancient. It's always hard dealing with older versions of c++

Nov 06 '24 09:11 sirus20x6

Just to bring water to the mill :wink:

c++11 is ancient. It's always hard dealing with older versions of c++

For me the main "problem" with c++11 is that it was the first release o many new interesting feature. As any first step it was not "perfect". The c++14 was a minor release that focus on solving some of the "defect". So for me c++14 is the minimal release to use.

Now for c++17/20.. If I look at gcc (we may have to do the same with clang/VisualC++/...): https://gcc.gnu.org/projects/cxx-status.html

C++20 Support in GCC: [...] Important: Because the ISO C++20 standard is very recent, GCC's support is experimental.

If I like to use some of there new feature, it may be not completely stable and need latest compiler release. If you use latest Fedora or next Rhel10 and have to create something new I'll really think of using it. But for this project with multi target code/os...

For C++17:

C++17 mode is the default since GCC 11.

And all Language Features is implemented for gcc8+. The oldest supported linux OS is RHEL8 that have GCC8.4 as default compiler (and devtoolset much more resent.) With FP8 I had use std++17 and look all check pass on all target. And there is nice elements (like "if constexpr" ...) that make some code more simple/readable.

So If i had to vote I will vote for c++17.

Nov 08 '24 20:11 Djip007

Yep. I think we're going with C++17 for now. Though as I mentioned earlier that warning about C++20 being new is primarily due to the modules which are not yet supported. I was mainly interested in std::span from C++20 which has been fully supported for a few years now.

Nov 08 '24 20:11 max-krasnyansky

I was mainly interested in std::span from C++20 which has been fully supported for a few years now.

some elements of std::span need c++23 and 26 :wink:

may be more mdspan / mdarray (with will be available on c++23/26 at the end) that have c++17/20 "backport" https://github.com/kokkos/mdspan

Nov 08 '24 21:11 Djip007

I'll have a look (for the FP8 support ) at the change for cpu backend (ggml-cpu)

I like to make that support more friendly with this change (in future MR?). So it look we split the "convertion" part (quantize_row_zzz / dequantize_row_zzz) and the "compute" part (ggml_vec_dot_zzz)

For now, these elements are still grouped together in the same file (ggml_quant). If I'm correct for this "issues" we may have to splite it, and move the ggml_vec_dot_zz in the backend?

For now, these elements are still grouped together in the same file (ggml_quant). If I'm not wrong about this "issue", we might have to split it, and move the ggml_vec_dot_zz into the backend?

Do you have any idea what structure you want? ie: for exemple:

create a ggml-cpu directory
add vec_dot.h/.cpp in it
move the ggml_vec_dot_ inside

Next: for the C++ refactor, do you think we can use the namespace for "clean" naming: For example, use the ggml:cpu namespace for all the internal functions of this backend. This will lead to a name change for example: ggml_vec_dot_zz => ggml::vec ::dot_...

I'll may try to split my fp8 support.

last: do you think we will/can use c++ for ggml-cpu.c (ie ggml-cpu.cpp) ?

Nov 08 '24 21:11 Djip007

For now, these elements are still grouped together in the same file (ggml_quant). If I'm correct for this "issues" we may have to splite it, and move the ggml_vec_dot_zz in the backend?

This is something that I am working on at the moment. The goal is to separate the CPU backend completely from the core ggml functionality, and there is still quite a bit of work left to do to achieve this.

or the C++ refactor, do you think we can use the namespace for "clean" naming

I wouldn't be opposed to that, but I don't think it's critical either.

last: do you think we will/can use c++ for ggml-cpu.c (ie ggml-cpu.cpp) ?

Yes, we only need to make the public interface compatible with C, the rest can be C++.

Nov 08 '24 22:11 slaren

@ggerganov Continuing the discussion from https://github.com/ggml-org/ggml/pull/1121#issuecomment-2681391467 (regarding code de-duplication).

Unless someone's already working on this, could I take up the following two steps?

Submit a PR that only ~~renames~~ moves ggml-cpu.c to ggml-cpu.cpp. no other changes. Edit: Moves the code to ggml-cpu.cpp, my apologies.
A second PR that templatizes a few functions, e.g. the binbcast ones that were added recently in ggml in my PR.

PR 1 - The first step is purely so that future diffs are easier to read/review. Otherwise combining the file rename step along with a massive refactor would be confusing to read.

PR 2 - If the template style used for a few functions looks fine, then a lot more functions can be templatized progressively. I think a lot of functions are low-hanging fruits wrt function templates.

I'm happy to take these two up, unless someone's already working on this.

Feb 25 '25 14:02 cmdr2

Submit a PR that only renames ggml-cpu.c to ggml-cpu.cpp. no other changes.

Sounds good, but be aware that likely a lot of code will need to be updated to work with the more strict C++ type system. A lot of the code in this file platform-specific, and checking that nothing is broken will take some work.

Feb 25 '25 14:02 slaren

@slaren Alternately, I could move a bunch of the operator functions to the existing ggml-cpu.cpp file, and templatize them. Not all at once, but progressively, while leaving the rest of ggml-cpu.c as it is.

This way, I can avoid the problems with a stricter type system, while still helping with reducing the size of the C file.

PS: Sorry, not sure how I missed that there's already a ggml-cpu.cpp file (while posting my earlier message)! :)

Feb 25 '25 16:02 cmdr2

Yes, that would also work.

Feb 25 '25 16:02 slaren

Part 1 (of a few more PRs), that moves out some of the low-hanging fruits - https://github.com/ggml-org/ggml/pull/1144 . This PR shaves about 2000 lines from ggml-cpu.c.

Mar 12 '25 15:03 cmdr2

Part 2, that moves all of the operators (except mul_mat), SIMD Mappings and vectorized fns into separate files - https://github.com/ggml-org/ggml/pull/1167 . This PR shrinks ggml-cpu.c to around 3500 lines (down from 15k before the first PR).

// no hurry, just mentioning it here for tracking on the roadmap ticket

Mar 31 '25 16:03 cmdr2