ggml
ggml copied to clipboard
[RFC] Add a header for PyTorch-like operator overloading syntax
Coming from the original PyTorch implementations, people are finding it increasingly cumbersome to type ggml_ and ctx over and over again. One line of Python can turn into 20 lines of C++. This is creating too much friction, and we are getting lost in the boilerplate instead of being able to see the big picture.
I would like to create a header that takes advantage of the C++ way of using operator overloading. Eventually, it will include PyTorch and NumPy aliases to allow simply copying-and-pasting code from Python into ggml C++ with only minor fixup.
The new struct will just wrap things like struct ggml_tensor_wrapper { ggml_tensor * data; ggml_context * ctx; }; the goal is to not change the resulting .exe binary. This will be done by making it header-only and declaring everything inline, ultimately still calling the ggml_ series of functions.
Example 1
Before
https://github.com/ggml-org/whisper.cpp/blob/5527454cdb3e15d7e2b8a6e2afcb58cb61651fd2/src/whisper.cpp#L2231-L2259
// feed-forward network
{
// norm
{
cur = ggml_norm(ctx0, inpFF, hparams.eps);
// cur = mlp_ln_w*cur + mlp_ln_b
cur = ggml_add(ctx0,
ggml_mul(ctx0, cur, layer.mlp_ln_w),
layer.mlp_ln_b);
}
// fully connected
cur = ggml_mul_mat(ctx0,
layer.mlp_0_w,
cur);
cur = ggml_add(ctx0, cur, layer.mlp_0_b);
// GELU activation
cur = ggml_gelu(ctx0, cur);
// projection
cur = ggml_mul_mat(ctx0,
layer.mlp_1_w,
cur);
cur = ggml_add(ctx0, cur, layer.mlp_1_b);
}
After
// feed-forward network
{
// norm
{
cur = inpFF.norm(hparams.eps);
cur = cur * layer.mlp_ln_w + layer.mlp_ln_b;
}
cur = (layer.mlp_0_w ^ cur) + layer.mlp_0_b; // fully connected
cur = cur.gelu();
cur = (layer.mlp_1_w ^ cur) + layer.mlp_1_b; // projection
}
Example 2
Before
https://github.com/ggml-org/whisper.cpp/blob/5527454cdb3e15d7e2b8a6e2afcb58cb61651fd2/src/whisper.cpp#L2748-L2759
struct ggml_tensor * aheads_KQs = ggml_reshape_2d(ctx0, KQ_soft_max, KQ_soft_max->ne[0] * KQ_soft_max->ne[1], KQ_soft_max->ne[2]);
aheads_KQs = ggml_transpose(ctx0, aheads_KQs);
aheads_KQs = ggml_cont(ctx0, aheads_KQs);
aheads_KQs = ggml_mul_mat(ctx0, wstate.aheads_masks.m[il], aheads_KQs);
aheads_KQs = ggml_transpose(ctx0, aheads_KQs);
aheads_KQs = ggml_cont(ctx0, aheads_KQs);
aheads_KQs = ggml_reshape_3d(ctx0, aheads_KQs, KQ_soft_max->ne[0], KQ_soft_max->ne[1], wstate.aheads_masks.m[il]->ne[1]);
if (aheads_cross_QKs == NULL) {
aheads_cross_QKs = aheads_KQs;
} else {
aheads_cross_QKs = ggml_concat(ctx0, aheads_cross_QKs, aheads_KQs, 2);
}
After
// typedef ggml_tensor_wrapper gg
// .flatten is from PyTorch
// .T is from numpy. For convenience, tensor.T() = tensor.transpose().cont()
gg aheads_KQs{KQ_soft_max.flatten(0, 1).T()};
aheads_KQs = (wstate.aheads_masks.m[il] ^ aheads_KQs).T();
aheads_KQs = aheads_KQs.reshape(KQ_soft_max->ne[0], KQ_soft_max->ne[1], wstate.aheads_masks.m[il]->ne[1]);
if (aheads_cross_QKs == NULL) {
aheads_cross_QKs = aheads_KQs;
} else {
aheads_cross_QKs = aheads_cross_QKs.concat(aheads_KQs, 2);
}
Example 3
I am drowning in boilerplate at https://github.com/mmwillet/TTS.cpp/blob/0b420102d53c16f36ea75e626a3a3d40d7b26a4d/src/kokoro_model.cpp#L1141 .
I experimented a bit with this in #581.
Agreed about the boilerplate, and the need for this. +1 if this can be kept as lightweight as possible.
Alternative approaches, included just for the sake of completeness in the discussion (born from the same need):
stable-diffusion.cppdefines a class-based system, loosely mimicking PyTorch'snn.Moduleclass - https://github.com/leejet/stable-diffusion.cpp/blob/5900ef6605c6fbf7934239f795c13c97bc993853/ggml_extend.hpp#L1458- I was experimenting with an even higher-level system, that mimics PyTorch's
nn.Moduleapproach even more faithfully, adds operator overloading, and automatically handles the creation and allocation of tensors necessary for the module. Auto-creation of backends etc. So you could almost translate PyTorch modules line-by-line. But this isn't lightweight, it's pretty much a framework.
I know these two go in a very different from the direction you're suggesting, and I like what you're suggesting.
@danielzgtg What would a basic implementation look like for a simple 2-parameter model? I mean the actual tensor_wrap struct and operator overloading functions. Thanks!
For e.g. https://github.com/cmdr2/study/blob/161f4f5741017e206c932b6c2ca83e27d24af295/ml/ggml-test/logic_gate.cpp#L24
struct logic_gate_model {
ggml_tensor* fc1_weight;
ggml_tensor* fc1_bias;
ggml_tensor* fc2_weight;
ggml_tensor* fc2_bias;
ggml_context* params_ctx;
};
With the computation graph:
ggml_tensor* fc1 = ggml_add(ctx, ggml_mul_mat(ctx, model.fc1_weight, x), model.fc1_bias); // multiply the weights, and add the bias
ggml_tensor* fc1_relu = ggml_relu(ctx, fc1);
ggml_tensor* fc2 = ggml_add(ctx, ggml_mul_mat(ctx, model.fc2_weight, fc1_relu), model.fc2_bias);
ggml_tensor* result = ggml_hardsigmoid(ctx, fc2);
@cmdr2 That would be:
gg fc1 = ((model.fc1_weight ^ gg{ctx, x}) + model.fc1_bias).relu();
gg fc2 = ((model.fc2_weight ^ fc1) + model.fc2_bias).hardsigmoid();
ggml_tensor * result = fc2; // implicit cast
^ is here as I can't use @ from NumPy but still need its brevity. Base-e will be .exp(), and other bases aren't used in machine learning.
I think that ggerganov wants to limit the scope of this: https://github.com/ggml-org/ggml/pull/581#pullrequestreview-1678714576 , so nn.Module is too big of an change/addition. Accordingly, if I remove the context stack from slaren's draft and focus on just the operator overloading, I hope this might have a chance at getting accepted.
@danielzgtg I agree, nn.Module-like systems are probably best left as separate frameworks for ggml, rather than shoehorning into ggml.
What I meant was to ask how the implementation of that syntax would go. For e.g. a basic implementation of the relu() function, the + operator overload, and the tensor_wrap struct.
The intent was to help flesh out your proposal more. Thanks!
When I started using ggml I was itching to do the same thing (and I enjoy doing this kind of work too). I've had the opposite experience over time though. Adding a handful of functions and passing some additional info around along with the ggml_context turned out to be good enough.
To illustrate, here is how I would write Example 1:
// reusable implementation of torch.nn.Linear
tensor linear(context c, tensor x) {
x = ggml_mul_mat(c, c.get("weight"), x);
if (tensor bias = c.find("bias")) {
x = ggml_add_inplace(c, x, bias);
}
return x;
}
cur = ggml_norm(c, inpFF, hparams.eps);
cur = ggml_mul(c, cur, c.get("mlp.ln.weight"));
cur = ggml_add(c, cur, c.get("mlp.ln.bias"));
cur = linear(c["mlp.0"], cur);
cur = ggml_gelu(c, cur);
cur = linear(c["mlp.1"], cur);
Every nn.Module becomes a simple function, usually with no more lines or bloat than the reference forward function in python. Yes you have to pass in the context and write a bit of ggml_, instead of passing self. and occasionally torch.functional.
There's no need to create structs or classes which resemble PyTorch modules. I keep track of the current prefix for weight names so it's easy to do the lookup while building the graph. (If it becomes a bottleneck it can be optimized, but unless you are rebuilding the graph all the time it's unlikely).
I do think stablediffusion.cpp is a really cool project, but I look at the code and virtual inheritance... shared_ptr... dynamic_cast...? why? it does seem boilerplaty, but not primarily because of ggml imo.
Now I don't want to argue that you have to do it like me or that it's the best way. Only that maybe functions & aggregates with a typedef or two can substantially reduce boilerplate already. The advantage being, that it's an addition to the existing ggml API, rather than a replacement. Wrapping substantial amount of the API is more effort, and also creates a certain split. What is then the "canonical" way to use ggml? Some code will use C++ API, some the C API, same for examples, tutorials, documentation...
You don't get math operators that way, and I admit they would be neat. But generally I like the one-line-per-op style. In python I might be looking for a concise mathematical form for research. In ggml I'm interested in performance, and not too keen to hide a ggml_cont in someting like .T(). If there's really a situation that I have many simple math operations in one place, it should scream "fuse me".
Finally, as an example of "reduce boilerplate by adding new function", have a PatchMerging implementation:
tensor patch_merging(context m, tensor x, int64_t w, int64_t h) {
auto [c, n, b, _] = nelements(x);
ASSERT(n == w * h, "Spatial dimensions do not match");
x = ggml_reshape_4d(m, x, c, w, h, b);
x = concat(m, {
slice(m, x, {}, {0, w, 2}, {0, h, 2}, {}), // x[:, 0:w:2, 0:h:2, :]
slice(m, x, {}, {0, w, 2}, {1, h, 2}, {}), // x[:, 0:w:2, 1:h:2, :]
slice(m, x, {}, {1, w, 2}, {0, h, 2}, {}), // x[:, 1:w:2, 0:h:2, :]
slice(m, x, {}, {1, w, 2}, {1, h, 2}, {})}, 0); // x[:, 1:w:2, 1:h:2, :]
x = ggml_reshape_3d(m, x, c * 4, n / 4, b);
x = layer_norm(m["norm"], x);
x = linear(m["reduction"], x);
return x;
}
Do I want to write it with ggml_view? Definitely not. But is it so important whether it's slice(m, x, ...) or x.slice(...)?