stable-diffusion.cpp Wip: ltx-video support

For now, the diffusion model seems to load in memory. the 128-D VAE is still completely unmplemented. Forward logic might be off.

TODO:

[x] rebase on master properly
[x] figure out how to work around tensor 'model.diffusion_model.proj_out.weight' has wrong shape in model file: got [1, 1, 2048, 128], expected [2048, 128, 1, 1] (diffusion model will hopefully load properly after that)
[ ] VAE support
[ ] Figure out modulation order
[ ] Make it generate Video
[ ] Use it to implement Pixart Alpha too

Dec 01 '24 15:12 stduhpf

this is great, thank u so much, cant wait to test it when its done

please support for quantized ltx models too and conversation too, there are fp8 in huggingface and fp16 gguf are in there,

please add for cpu users too

it will be great if u make it work to have im2vid

also in ltx video playground in huggingface there is advanced options that we can make videos up to 11 seconds like 512x320 resolution with 257 frames, it will be great ifwe can make long videos here too

Dec 03 '24 09:12 Amin456789

Convertion/quantization should be working already.

Dec 03 '24 10:12 stduhpf

The 5D tensors in the VAE are a pain to deal with. I'm losing motivation...

Dec 08 '24 00:12 stduhpf

it is ok, this is hard task, we got unworking svd too, seems video is harder to impelent in sd cpp, thank u for ur hard work let me summon the staff of ltx themselves: @yoavhacohen could u guys give us a hand here please, we seems have hard times to bring a video to cpp here

Dec 08 '24 13:12 Amin456789

The 5D tensors in the VAE are a pain to deal with.

@stduhpf Which operators require more than 4 dimensional tensors? Can these tensors be transformed to less dimensions? Maybe with appropriate combination of ggml_view and ggml_reshape it can be done. We had similar issues when implementing SAM and I remember we managed to avoid the need for 5D tensors which were used in the original implementation. Hopefully there is a workaround, since adding support for 5D and more dimensions to ggml would be very difficult.

Dec 08 '24 18:12 ggerganov

@ggerganov Basically the whole VAE is made of 3D convolutions, so this means a 3x3x3 kernel for each input/output channel pair. Maybe there is a way to flatten it to use Conv2d instead, but I couldn't figure it out.

Dec 08 '24 18:12 stduhpf

Hm, indeed it's not obvious. I guess we will need to increase the GGML_MAX_DIMS at some point.

Dec 08 '24 19:12 ggerganov

it is ok, this is hard task, we got unworking svd too, seems video is harder to impelent in sd cpp, thank u for ur hard work let me summon the staff of ltx themselves: @yoavhacohen could u guys give us a hand here please, we seems have hard times to bring a video to cpp here

It’s understandable that this task is challenging, and I appreciate everyone’s efforts so far. Based on the comments, the issue seems to stem from the lack of conv3d implementation in the GGML library.

Although I’m not familiar with GGML, I noticed that conv2d is implemented using im2col and matmul: https://github.com/ggerganov/ggml/blob/a5960e80d3e65ce6ff18f90315ab96f63cf9c4cc/src/ggml.c#L3884

The same principle can be extended to conv3d using a 3D version of im2col. Here’s a high-level approach:

Implementing conv3d:

You can create an im2col_3d tensor and perform matrix multiplication for convolution, similar to the conv2d implementation. Below is sample (untested) code:

// a: [OC, IC, KD, KH, KW]
// b: [N, IC, ID, IH, IW]
// result: [N, OC, OD, OH, OW]
struct ggml_tensor * ggml_conv_3d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   s0, // stride depth
        int                   s1, // stride height
        int                   s2, // stride width
        int                   p0, // padding depth
        int                   p1, // padding height
        int                   p2, // padding width
        int                   d0, // dilation depth
        int                   d1, // dilation height
        int                   d2  // dilation width) {
    // Create im2col tensor for 3D input
    struct ggml_tensor * im2col = ggml_im2col_3d(ctx, a, b, s0, s1, s2, p0, p1, p2, d0, d1, d2, true, a->type); // [N, OD, OH, OW, IC * KD * KH * KW]

    // Perform matrix multiplication for the convolution
    struct ggml_tensor * result =
        ggml_mul_mat(ctx,
                ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[4] * im2col->ne[3] * im2col->ne[2] * im2col->ne[1]), // [N, OD, OH, OW, IC * KD * KH * KW] => [N*OD*OH*OW, IC * KD * KH * KW]
                ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2] * a->ne[3]), a->ne[4]));                          // [OC, IC, KD, KH, KW] => [OC, IC * KD * KH * KW]

    // Reshape the result back to a 5D tensor
    result = ggml_reshape_5d(ctx, result, im2col->ne[1], im2col->ne[2], im2col->ne[3], im2col->ne[4], a->ne[4]); // [OC, N, OD, OH, OW]
    result = ggml_cont(ctx, ggml_permute(ctx, result, 0, 1, 4, 3, 2)); // [N, OC, OD, OH, OW]

    return result;
}

Implementing im2col_3d:

Since GGML lacks im2col_3d, you can emulate it using a composition of two im2col operations:

Apply 1D im2col along the depth dimension.
Step 2: Apply 2D im2col for the height and width dimensions.

struct ggml_tensor * ggml_im2col_3d(
        struct ggml_context * ctx,
        struct ggml_tensor  * a,
        struct ggml_tensor  * b,
        int                   s0, // stride depth
        int                   s1, // stride height
        int                   s2, // stride width
        int                   p0, // padding depth
        int                   p1, // padding height
        int                   p2, // padding width
        int                   d0, // dilation depth
        int                   d1, // dilation height
        int                   d2, // dilation width
        enum ggml_type        dst_type) {

    // Step 1: Perform 1D im2col along the depth dimension
    const int64_t OD = ggml_calc_conv_output_size(b->ne[2], a->ne[2], s0, p0, d0); // Depth
    const int64_t IH = b->ne[3];
    const int64_t IW = b->ne[4];
    const int64_t IC_KD = b->ne[1] * a->ne[2]; // IC * KD

    const int64_t ne1[5] = { IC_KD, IW, IH, OD, b->ne[0] }; // Intermediate tensor shape: [N, OD, IH, IW, IC * KD]

    struct ggml_tensor * intermediate = ggml_new_tensor(ctx, dst_type, 5, ne1);

    int32_t params_1d[] = { s0, 1, p0, 0, d0, 1 }; // Stride and padding for depth dimension
    ggml_set_op_params(intermediate, params_1d, sizeof(params_1d));
    intermediate->op     = GGML_OP_IM2COL; // Use existing im2col operation
    intermediate->src[0] = a;
    intermediate->src[1] = b;

    // Step 2: Perform 2D im2col on the intermediate tensor for height and width
    const int64_t OH = ggml_calc_conv_output_size(IH, a->ne[3], s1, p1, d1); // Height
    const int64_t OW = ggml_calc_conv_output_size(IW, a->ne[4], s2, p2, d2); // Width

    const int64_t ne2[5] = { IC_KD * a->ne[3] * a->ne[4], OW, OH, OD, b->ne[0] }; // Final output shape: [N, OD, OH, OW, IC * KD * KH * KW]

    struct ggml_tensor * result = ggml_new_tensor(ctx, dst_type, 5, ne2);

    int32_t params_2d[] = { s2, s1, p2, p1, d2, d1 }; // Stride and padding for height and width dimensions
    ggml_set_op_params(result, params_2d, sizeof(params_2d));
    result->op     = GGML_OP_IM2COL; // Use existing im2col operation
    result->src[0] = a;              // Use filter tensor
    result->src[1] = intermediate;   // Intermediate tensor from step 1

    return result;
}

As already stated GGML_MAX_DIMS should be increased to 5 to support 5D tensors.

These are just starting points and will need testing and optimization. Does this align with your understanding? Are there additional constraints or goals that we should consider?

Dec 08 '24 19:12 yoavhacohen

@ggerganov Basically the whole VAE is made of 3D convolutions, so this means a 3x3x3 kernel for each input/output channel pair. Maybe there is a way to flatten it to use Conv2d instead, but I couldn't figure it out.

@stduhpf The conv3d op was just added: https://github.com/ggml-org/llama.cpp/pull/15182, if you’re still interested in continuing the work

Aug 22 '25 13:08 rmatif