ggml icon indicating copy to clipboard operation
ggml copied to clipboard

[RFC] Add a header for PyTorch-like operator overloading syntax

Open danielzgtg opened this issue 3 months ago • 6 comments

Coming from the original PyTorch implementations, people are finding it increasingly cumbersome to type ggml_ and ctx over and over again. One line of Python can turn into 20 lines of C++. This is creating too much friction, and we are getting lost in the boilerplate instead of being able to see the big picture.

I would like to create a header that takes advantage of the C++ way of using operator overloading. Eventually, it will include PyTorch and NumPy aliases to allow simply copying-and-pasting code from Python into ggml C++ with only minor fixup.

The new struct will just wrap things like struct ggml_tensor_wrapper { ggml_tensor * data; ggml_context * ctx; }; the goal is to not change the resulting .exe binary. This will be done by making it header-only and declaring everything inline, ultimately still calling the ggml_ series of functions.

Example 1

Before

https://github.com/ggml-org/whisper.cpp/blob/5527454cdb3e15d7e2b8a6e2afcb58cb61651fd2/src/whisper.cpp#L2231-L2259

// feed-forward network
{
    // norm
    {
        cur = ggml_norm(ctx0, inpFF, hparams.eps);

        // cur = mlp_ln_w*cur + mlp_ln_b
        cur = ggml_add(ctx0,
                ggml_mul(ctx0, cur, layer.mlp_ln_w),
                layer.mlp_ln_b);
    }

    // fully connected
    cur = ggml_mul_mat(ctx0,
            layer.mlp_0_w,
            cur);

    cur = ggml_add(ctx0, cur, layer.mlp_0_b);

    // GELU activation
    cur = ggml_gelu(ctx0, cur);

    // projection
    cur = ggml_mul_mat(ctx0,
            layer.mlp_1_w,
            cur);

    cur = ggml_add(ctx0, cur, layer.mlp_1_b);
}

After

// feed-forward network
{
    // norm
    {
        cur = inpFF.norm(hparams.eps);
        cur = cur * layer.mlp_ln_w + layer.mlp_ln_b;
    }

    cur = (layer.mlp_0_w ^ cur) + layer.mlp_0_b; // fully connected
    cur = cur.gelu();
    cur = (layer.mlp_1_w ^ cur) + layer.mlp_1_b; // projection
}

Example 2

Before

https://github.com/ggml-org/whisper.cpp/blob/5527454cdb3e15d7e2b8a6e2afcb58cb61651fd2/src/whisper.cpp#L2748-L2759

struct ggml_tensor * aheads_KQs = ggml_reshape_2d(ctx0, KQ_soft_max, KQ_soft_max->ne[0] * KQ_soft_max->ne[1], KQ_soft_max->ne[2]);
aheads_KQs = ggml_transpose(ctx0, aheads_KQs);
aheads_KQs = ggml_cont(ctx0, aheads_KQs);
aheads_KQs = ggml_mul_mat(ctx0, wstate.aheads_masks.m[il], aheads_KQs);
aheads_KQs = ggml_transpose(ctx0, aheads_KQs);
aheads_KQs = ggml_cont(ctx0, aheads_KQs);
aheads_KQs = ggml_reshape_3d(ctx0, aheads_KQs, KQ_soft_max->ne[0], KQ_soft_max->ne[1], wstate.aheads_masks.m[il]->ne[1]);
if (aheads_cross_QKs == NULL) {
    aheads_cross_QKs = aheads_KQs;
} else {
    aheads_cross_QKs = ggml_concat(ctx0, aheads_cross_QKs, aheads_KQs, 2);
}

After

// typedef ggml_tensor_wrapper gg
// .flatten is from PyTorch
// .T is from numpy. For convenience, tensor.T() = tensor.transpose().cont()
gg aheads_KQs{KQ_soft_max.flatten(0, 1).T()};
aheads_KQs = (wstate.aheads_masks.m[il] ^ aheads_KQs).T();
aheads_KQs = aheads_KQs.reshape(KQ_soft_max->ne[0], KQ_soft_max->ne[1], wstate.aheads_masks.m[il]->ne[1]);
if (aheads_cross_QKs == NULL) {
    aheads_cross_QKs = aheads_KQs;
} else {
    aheads_cross_QKs = aheads_cross_QKs.concat(aheads_KQs, 2);
}

Example 3

I am drowning in boilerplate at https://github.com/mmwillet/TTS.cpp/blob/0b420102d53c16f36ea75e626a3a3d40d7b26a4d/src/kokoro_model.cpp#L1141 .

danielzgtg avatar Aug 13 '25 12:08 danielzgtg

I experimented a bit with this in #581.

slaren avatar Aug 13 '25 12:08 slaren

Agreed about the boilerplate, and the need for this. +1 if this can be kept as lightweight as possible.

Alternative approaches, included just for the sake of completeness in the discussion (born from the same need):

  • stable-diffusion.cpp defines a class-based system, loosely mimicking PyTorch's nn.Module class - https://github.com/leejet/stable-diffusion.cpp/blob/5900ef6605c6fbf7934239f795c13c97bc993853/ggml_extend.hpp#L1458
  • I was experimenting with an even higher-level system, that mimics PyTorch's nn.Module approach even more faithfully, adds operator overloading, and automatically handles the creation and allocation of tensors necessary for the module. Auto-creation of backends etc. So you could almost translate PyTorch modules line-by-line. But this isn't lightweight, it's pretty much a framework.

I know these two go in a very different from the direction you're suggesting, and I like what you're suggesting.

cmdr2 avatar Aug 14 '25 06:08 cmdr2

@danielzgtg What would a basic implementation look like for a simple 2-parameter model? I mean the actual tensor_wrap struct and operator overloading functions. Thanks!

For e.g. https://github.com/cmdr2/study/blob/161f4f5741017e206c932b6c2ca83e27d24af295/ml/ggml-test/logic_gate.cpp#L24

struct logic_gate_model {
    ggml_tensor* fc1_weight;
    ggml_tensor* fc1_bias;
    ggml_tensor* fc2_weight;
    ggml_tensor* fc2_bias;
    ggml_context* params_ctx;
};

With the computation graph:

ggml_tensor* fc1 = ggml_add(ctx, ggml_mul_mat(ctx, model.fc1_weight, x), model.fc1_bias);  // multiply the weights, and add the bias
ggml_tensor* fc1_relu = ggml_relu(ctx, fc1);
ggml_tensor* fc2 = ggml_add(ctx, ggml_mul_mat(ctx, model.fc2_weight, fc1_relu), model.fc2_bias);
ggml_tensor* result = ggml_hardsigmoid(ctx, fc2);

cmdr2 avatar Aug 14 '25 07:08 cmdr2

@cmdr2 That would be:

gg fc1 = ((model.fc1_weight ^ gg{ctx, x}) + model.fc1_bias).relu();
gg fc2 = ((model.fc2_weight ^ fc1) + model.fc2_bias).hardsigmoid();
ggml_tensor * result = fc2; // implicit cast

^ is here as I can't use @ from NumPy but still need its brevity. Base-e will be .exp(), and other bases aren't used in machine learning.

I think that ggerganov wants to limit the scope of this: https://github.com/ggml-org/ggml/pull/581#pullrequestreview-1678714576 , so nn.Module is too big of an change/addition. Accordingly, if I remove the context stack from slaren's draft and focus on just the operator overloading, I hope this might have a chance at getting accepted.

danielzgtg avatar Aug 14 '25 08:08 danielzgtg

@danielzgtg I agree, nn.Module-like systems are probably best left as separate frameworks for ggml, rather than shoehorning into ggml.

What I meant was to ask how the implementation of that syntax would go. For e.g. a basic implementation of the relu() function, the + operator overload, and the tensor_wrap struct.

The intent was to help flesh out your proposal more. Thanks!

cmdr2 avatar Aug 14 '25 08:08 cmdr2

When I started using ggml I was itching to do the same thing (and I enjoy doing this kind of work too). I've had the opposite experience over time though. Adding a handful of functions and passing some additional info around along with the ggml_context turned out to be good enough.

To illustrate, here is how I would write Example 1:

// reusable implementation of torch.nn.Linear
tensor linear(context c, tensor x) {
    x = ggml_mul_mat(c, c.get("weight"), x);
    if (tensor bias = c.find("bias")) {
        x = ggml_add_inplace(c, x, bias);
    }
    return x;
}

cur = ggml_norm(c, inpFF, hparams.eps);
cur = ggml_mul(c, cur, c.get("mlp.ln.weight"));
cur = ggml_add(c, cur, c.get("mlp.ln.bias"));
cur = linear(c["mlp.0"], cur);
cur = ggml_gelu(c, cur);
cur = linear(c["mlp.1"], cur);

Every nn.Module becomes a simple function, usually with no more lines or bloat than the reference forward function in python. Yes you have to pass in the context and write a bit of ggml_, instead of passing self. and occasionally torch.functional.

There's no need to create structs or classes which resemble PyTorch modules. I keep track of the current prefix for weight names so it's easy to do the lookup while building the graph. (If it becomes a bottleneck it can be optimized, but unless you are rebuilding the graph all the time it's unlikely).

I do think stablediffusion.cpp is a really cool project, but I look at the code and virtual inheritance... shared_ptr... dynamic_cast...? why? it does seem boilerplaty, but not primarily because of ggml imo.

Now I don't want to argue that you have to do it like me or that it's the best way. Only that maybe functions & aggregates with a typedef or two can substantially reduce boilerplate already. The advantage being, that it's an addition to the existing ggml API, rather than a replacement. Wrapping substantial amount of the API is more effort, and also creates a certain split. What is then the "canonical" way to use ggml? Some code will use C++ API, some the C API, same for examples, tutorials, documentation...

You don't get math operators that way, and I admit they would be neat. But generally I like the one-line-per-op style. In python I might be looking for a concise mathematical form for research. In ggml I'm interested in performance, and not too keen to hide a ggml_cont in someting like .T(). If there's really a situation that I have many simple math operations in one place, it should scream "fuse me".

Finally, as an example of "reduce boilerplate by adding new function", have a PatchMerging implementation:

tensor patch_merging(context m, tensor x, int64_t w, int64_t h) {
    auto [c, n, b, _] = nelements(x);
    ASSERT(n == w * h, "Spatial dimensions do not match");

    x = ggml_reshape_4d(m, x, c, w, h, b);
    x = concat(m, {
        slice(m, x, {}, {0, w, 2}, {0, h, 2}, {}),       // x[:, 0:w:2, 0:h:2, :]
        slice(m, x, {}, {0, w, 2}, {1, h, 2}, {}),       // x[:, 0:w:2, 1:h:2, :]
        slice(m, x, {}, {1, w, 2}, {0, h, 2}, {}),       // x[:, 1:w:2, 0:h:2, :]
        slice(m, x, {}, {1, w, 2}, {1, h, 2}, {})}, 0);  // x[:, 1:w:2, 1:h:2, :]
    x = ggml_reshape_3d(m, x, c * 4, n / 4, b);

    x = layer_norm(m["norm"], x);
    x = linear(m["reduction"], x);
    return x;
}

Do I want to write it with ggml_view? Definitely not. But is it so important whether it's slice(m, x, ...) or x.slice(...)?

Acly avatar Aug 17 '25 13:08 Acly