Wip: ltx-video support
For now, the diffusion model seems to load in memory. the 128-D VAE is still completely unmplemented. Forward logic might be off.
TODO:
- [x] rebase on master properly
- [x] figure out how to work around
tensor 'model.diffusion_model.proj_out.weight' has wrong shape in model file: got [1, 1, 2048, 128], expected [2048, 128, 1, 1](diffusion model will hopefully load properly after that) - [ ] VAE support
- [ ] Figure out modulation order
- [ ] Make it generate Video
- [ ] Use it to implement Pixart Alpha too
this is great, thank u so much, cant wait to test it when its done
please support for quantized ltx models too and conversation too, there are fp8 in huggingface and fp16 gguf are in there,
please add for cpu users too
it will be great if u make it work to have im2vid
also in ltx video playground in huggingface there is advanced options that we can make videos up to 11 seconds like 512x320 resolution with 257 frames, it will be great ifwe can make long videos here too
Convertion/quantization should be working already.
The 5D tensors in the VAE are a pain to deal with. I'm losing motivation...
it is ok, this is hard task, we got unworking svd too, seems video is harder to impelent in sd cpp, thank u for ur hard work let me summon the staff of ltx themselves: @yoavhacohen could u guys give us a hand here please, we seems have hard times to bring a video to cpp here
The 5D tensors in the VAE are a pain to deal with.
@stduhpf Which operators require more than 4 dimensional tensors? Can these tensors be transformed to less dimensions? Maybe with appropriate combination of ggml_view and ggml_reshape it can be done. We had similar issues when implementing SAM and I remember we managed to avoid the need for 5D tensors which were used in the original implementation. Hopefully there is a workaround, since adding support for 5D and more dimensions to ggml would be very difficult.
@ggerganov Basically the whole VAE is made of 3D convolutions, so this means a 3x3x3 kernel for each input/output channel pair. Maybe there is a way to flatten it to use Conv2d instead, but I couldn't figure it out.
Hm, indeed it's not obvious. I guess we will need to increase the GGML_MAX_DIMS at some point.
it is ok, this is hard task, we got unworking svd too, seems video is harder to impelent in sd cpp, thank u for ur hard work let me summon the staff of ltx themselves: @yoavhacohen could u guys give us a hand here please, we seems have hard times to bring a video to cpp here
It’s understandable that this task is challenging, and I appreciate everyone’s efforts so far. Based on the comments, the issue seems to stem from the lack of conv3d implementation in the GGML library.
Although I’m not familiar with GGML, I noticed that conv2d is implemented using im2col and matmul:
https://github.com/ggerganov/ggml/blob/a5960e80d3e65ce6ff18f90315ab96f63cf9c4cc/src/ggml.c#L3884
The same principle can be extended to conv3d using a 3D version of im2col. Here’s a high-level approach:
Implementing conv3d:
You can create an im2col_3d tensor and perform matrix multiplication for convolution, similar to the conv2d implementation. Below is sample (untested) code:
// a: [OC, IC, KD, KH, KW]
// b: [N, IC, ID, IH, IW]
// result: [N, OC, OD, OH, OW]
struct ggml_tensor * ggml_conv_3d(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int s0, // stride depth
int s1, // stride height
int s2, // stride width
int p0, // padding depth
int p1, // padding height
int p2, // padding width
int d0, // dilation depth
int d1, // dilation height
int d2 // dilation width) {
// Create im2col tensor for 3D input
struct ggml_tensor * im2col = ggml_im2col_3d(ctx, a, b, s0, s1, s2, p0, p1, p2, d0, d1, d2, true, a->type); // [N, OD, OH, OW, IC * KD * KH * KW]
// Perform matrix multiplication for the convolution
struct ggml_tensor * result =
ggml_mul_mat(ctx,
ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[4] * im2col->ne[3] * im2col->ne[2] * im2col->ne[1]), // [N, OD, OH, OW, IC * KD * KH * KW] => [N*OD*OH*OW, IC * KD * KH * KW]
ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2] * a->ne[3]), a->ne[4])); // [OC, IC, KD, KH, KW] => [OC, IC * KD * KH * KW]
// Reshape the result back to a 5D tensor
result = ggml_reshape_5d(ctx, result, im2col->ne[1], im2col->ne[2], im2col->ne[3], im2col->ne[4], a->ne[4]); // [OC, N, OD, OH, OW]
result = ggml_cont(ctx, ggml_permute(ctx, result, 0, 1, 4, 3, 2)); // [N, OC, OD, OH, OW]
return result;
}
Implementing im2col_3d:
Since GGML lacks im2col_3d, you can emulate it using a composition of two im2col operations:
- Apply 1D im2col along the depth dimension.
- Step 2: Apply 2D im2col for the height and width dimensions.
struct ggml_tensor * ggml_im2col_3d(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int s0, // stride depth
int s1, // stride height
int s2, // stride width
int p0, // padding depth
int p1, // padding height
int p2, // padding width
int d0, // dilation depth
int d1, // dilation height
int d2, // dilation width
enum ggml_type dst_type) {
// Step 1: Perform 1D im2col along the depth dimension
const int64_t OD = ggml_calc_conv_output_size(b->ne[2], a->ne[2], s0, p0, d0); // Depth
const int64_t IH = b->ne[3];
const int64_t IW = b->ne[4];
const int64_t IC_KD = b->ne[1] * a->ne[2]; // IC * KD
const int64_t ne1[5] = { IC_KD, IW, IH, OD, b->ne[0] }; // Intermediate tensor shape: [N, OD, IH, IW, IC * KD]
struct ggml_tensor * intermediate = ggml_new_tensor(ctx, dst_type, 5, ne1);
int32_t params_1d[] = { s0, 1, p0, 0, d0, 1 }; // Stride and padding for depth dimension
ggml_set_op_params(intermediate, params_1d, sizeof(params_1d));
intermediate->op = GGML_OP_IM2COL; // Use existing im2col operation
intermediate->src[0] = a;
intermediate->src[1] = b;
// Step 2: Perform 2D im2col on the intermediate tensor for height and width
const int64_t OH = ggml_calc_conv_output_size(IH, a->ne[3], s1, p1, d1); // Height
const int64_t OW = ggml_calc_conv_output_size(IW, a->ne[4], s2, p2, d2); // Width
const int64_t ne2[5] = { IC_KD * a->ne[3] * a->ne[4], OW, OH, OD, b->ne[0] }; // Final output shape: [N, OD, OH, OW, IC * KD * KH * KW]
struct ggml_tensor * result = ggml_new_tensor(ctx, dst_type, 5, ne2);
int32_t params_2d[] = { s2, s1, p2, p1, d2, d1 }; // Stride and padding for height and width dimensions
ggml_set_op_params(result, params_2d, sizeof(params_2d));
result->op = GGML_OP_IM2COL; // Use existing im2col operation
result->src[0] = a; // Use filter tensor
result->src[1] = intermediate; // Intermediate tensor from step 1
return result;
}
As already stated GGML_MAX_DIMS should be increased to 5 to support 5D tensors.
These are just starting points and will need testing and optimization. Does this align with your understanding? Are there additional constraints or goals that we should consider?
@ggerganov Basically the whole VAE is made of 3D convolutions, so this means a 3x3x3 kernel for each input/output channel pair. Maybe there is a way to flatten it to use Conv2d instead, but I couldn't figure it out.
@stduhpf The conv3d op was just added: https://github.com/ggml-org/llama.cpp/pull/15182, if you’re still interested in continuing the work