FFmpegInteropX Libass Integration

Hi,

I've made some changes to FFmpegInteropX to support libass as a subtitle renderer. The implementation is largely inspired by the libass integration for JavaScript, which you can find here: https://github.com/libass/JavascriptSubtitlesOctopus

By default, libass operates as follows:

Initialize the library using ass_library_init.
Initialize the renderer using ass_renderer_init.
Create a subtitle track using ass_read_memory (other methods exist, but we're constrained by UWP).
Load the subtitle header using ass_process_codec_private.
Add subtitle chunks from FFmpeg using ass_process_chunk.
The issue I'm encountering is with creating IMediaCue.

Libass uses ass_render_frame to generate an ASS_Image, which works well for rendering. However, since this process must happen in real-time, I’m unsure if it's feasible to create IMediaCue instances based on current implementation. Is it possible to display subtitles accurately using media duration?

P.S. I’ve noticed a recent issue compiling FFmpegInteropX with the target platform set to 10.0.22000.0. To resolve it, I switched to 10.0.26100.0.

Thanks. #439

Dec 30 '24 07:12 ramtinak

I just tested my C# sample from https://github.com/ffmpeginteropx/FFmpegInteropX/issues/439#issuecomment-2561746413 , and instead of using a timer to update the UI, I switched to mediaPlayer.PlaybackSession.PositionChanged for updates.

However, I noticed that PositionChanged is significantly slower compared to using a timer to repeatedly query mediaPlayer.PlaybackSession.Position.

Dec 30 '24 07:12 ramtinak

Thanks for sharing this. It is a good starting point for the integration.

Dec 30 '24 20:12 brabebhin

This is almost complete. We have to call Blend in the CreateCue method, and then create a bitmap from that blend result. Nothing too fancy there.

The problem you are having, which is the same problem I was having is that ass_render_frame does not work. It returns NULL, and therefore nothing is being blended for all my test files. This happens despite the ass_read_memory and ass_process_chunk being called and the track correctly having events inside. My approach was slightly different, using an ass_track for each cue, which is safer in seek and flush situations, but this can be refactored later.

Dec 31 '24 12:12 brabebhin

Happy New Year.

This issue occurs when you don't call ass_set_fonts after initializing ASS_Renderer. I realized I had forgotten to include this step.

Regarding your point, I'm not entirely sure you're correct. Calling ass_render_frame inside createcue doesn't seem appropriate (at least, I don't think so). The ass_render_frame function should only be called when a frame changes. I believe createcue doesn't handle this scenario.

I made some adjustments, and while the changes work to some extent, the SoftwareBitmap isn't being displayed as expected.

Here's what I tested:
I used the MediaPlayerCS sample, added an Image control to the UI, and set up a CueEntered event as follows: (This actually worked)

private async void OnTimedTrackCueEntered(TimedMetadataTrack sender, MediaCueEventArgs args)
{
    if (args.Cue is ImageCue cue)
    {
        await Dispatcher.RunAsync(CoreDispatcherPriority.Normal, async () =>
        {
            var sub = cue.SoftwareBitmap;
            var bitmapSource = new SoftwareBitmapSource();

            await bitmapSource.SetBitmapAsync(sub);

            image.Source = bitmapSource;

            Debug.WriteLine($"{cue.StartTime} | {cue.Duration} | {sub.PixelWidth}x{sub.PixelHeight}");
        });
    }
}

<Image x:Name="image" Grid.Row="1" Width="400" Height="300" />

Update Issue: The image doesn't update consistently, but it's a start.
Pixel Calculation Error: There's an issue with pixel calculations somewhere in the code.

libass requires ass_render_frame to be called for every frame. So, how should the ImageCue handle StartTime and Duration in this context?

MediaPlayerCS

PotPlayer:

Jan 01 '25 04:01 ramtinak

Using ImageCues for ASS rendering is not suitable. It just doesn't go together. ImageCue timed tracks are for static bitmap subtitles like dvdsub, dvbsub or HDMV/PGS.

Even though the definition of ASS events involves a start time and a duration, an ASS event doesn't necessarily stand for a bitmap which remains static (unchanged) over the duration of an event - but that's what ImageCue bitmaps are designed for.

Also, there's another mismatch: ASS event can overlap in time, so there can multiple be active at the same time. You are creating an ImageCue for each ASS event - which would still be fine in case of static bitmaps and without libass. But libass doesn't render any output that is related to a specific ASS event, so in turn it also can't render anything that is related to a specific image cue.

Even further, an ImageCue is supposed to have a fixed position and size, but libass doesn't give you anything like that. Both can change from frame to frame.

The overall conclusion is simply that TimedMetadataTrack and ImageCue aren't a suitable APIs for the way how libass is rendering its output: frame-wise (not per ass-event)

You need to call ass_render_frame() once for each video frame being shown (or for every second one, etc..) and when detect_change is 1, you need to bring that output on screen in some way.

Jan 01 '25 06:01 softworkz

In case when you would want to render ass subs statically without animation:

That would of course be possible with ImageCues - but the problem here is that the FFmpegInteropX SubtitleProvider base implemention is not suitable for this, since it assumes that each AVPacket would create one MediaCue and that doesn't work out in this case.

It would work like this:

Feed all ASS events (AVPacket) into libass (this happens on the start of playback, because ASS subs are not interleaved in the stream)
While doing so, for each ASS event
- Modify the event to strip all animations (I have code for that)
- Put start and end time of each event into a list of time stamps (1 dimensional, without distinction of start and end)
Finally, distinct and sort that list
Now, iterate through that list and for each timestamp
- ass_render an image
- use graphical algorithms to detect regions with content
- create an image cue for each regiion
- all have the same start and duration (from current time to next timestamp in the list)

Finally, there's one problem to solve: You don't want to create all the images on playback start, so you need to synchronize in some way with the playback position and make sure you only create image cues for e.g. the next 30s.

This will give you static non-animated rendering of ASS subs - and that can be done using ImageCue. You also don't need to care about the rendering but let Windows Media to it (so no listening to cue entered events).

PS: Happy New Year as well!

Jan 01 '25 07:01 softworkz

Happy new year everyone! Excellent work. This is an important milestone. We can modify the SubtitleProvider to return a list of ImageCues, one for each individual frame in the animation.

Jan 01 '25 09:01 brabebhin

I’ve added a new function to the SubtitleProvider class, which creates a new collection of IMediaCue objects (IVector<IMediaCue>). In this function, I populate a list of cues based on position and duration. For this implementation, I used a loop with a duration of 500 milliseconds for each cue.

Despite this, subtitles still don’t display unless they’re manually added in C#.

Additionally, I implemented a new function in FFmpegMediaSource to capture the current frame from libass directly. Here is the result of that implementation:

https://1drv.ms/v/c/6ad13c09a43a4b36/Ef1Xvke1IG9MutjWh7NkQUkBP_BewvVVJwhKaahjI9nmNg?e=fIAgZ0

However, there’s an issue with blending colors—the color calculation is incorrect. None of the displayed colors match the actual intended colors. For reference, the correct colors should look like this:
https://1drv.ms/v/s!AjZLOqQJPNFqf5OHo1X5i1OD_WA?e=bm7fle

Jan 01 '25 15:01 ramtinak

Hmm I was under the impression that the blending algorithm came from the JavaScript implementation? Haven't looked much at it, although iirc something jumped my eye at some point that seemed incorrect.

In any case, colours aren't so important. We need to think through the animation side.

Assuming the animation fps is the same as the video fps (the libass API seems to point in that direction), we can use the sample request events to drain the subtitle provider of animation frames. It would work similarly to avcodec or the filter graph. So some more ample refactoring might be necessary here.

I see no reason ImageCue cannot handle animations, assuming 1 cue = 1 animation frame. Other than maybe potential performance problems in MPE.

Jan 01 '25 17:01 brabebhin

I see no reason ImageCue cannot handle animations

I do. One full-size PNG image for each video frame? Seriously?

Jan 01 '25 22:01 softworkz

I found out why the cue doesn't appear in the UI: the (x, y) cuePosition was set to 100. I changed it to 0, and the subtitle displayed correctly.

The ConvertASSImageToSoftwareBitmap function was created with the help of ChatGPT, so I’m not sure where it came from. However, I referenced multiple sources from different projects to ChatGPT, but none of them seem to work correctly.

I also tried animations again, but there are many dropped cues, and most of them don't show up. However, as you can see, it works fine when you render it yourself with a timer—it's fast and works (except for the color part, of course).

As @softworkz mentioned, I think the ImageCue is not meant to be used for animation effects.

A side thought: Is it possible that our data (styles and colors) are incorrect when appending it to libass?

Jan 01 '25 23:01 ramtinak

A side thought: Is it possible that our data (styles and colors) are incorrect when appending it to libass?

You can easily find out by not doing it. For actual ASS subtitles, this shouldn't be done anyway.

Jan 01 '25 23:01 softworkz

As @softworkz mentioned, I think the ImageCue is not meant to be used for animation effects.

Of course not 😆

Jan 01 '25 23:01 softworkz

I do. One full-size PNG image for each video frame? Seriously?

It doesn't have to be full size though, does it?

The ImageCue represents exactly 1 frame in the animation. We should strive to achieve that before we try to do custom rendering. I'll take a shot at refactoring the SubtitleProvider to support multiple cues.

Jan 01 '25 23:01 brabebhin

I do. One full-size PNG image for each video frame? Seriously?

It doesn't have to be full size though, does it?

You don't know it up-front
If there's a small text at the left top and at the bottom right it would still be almost full-screen (unnecessarily)

The ImageCue represents exactly 1 frame in the animation.

It's not made for this.

We should strive to achieve that

As far as I understood @ramtinak, he "strived" already and it didn't work out (which is what I had expected)

Jan 01 '25 23:01 softworkz

The integration is nowhere near "production ready". We have some pocs that are extremely important for understanding how this works but that's about it.

Given past integrations, expect this to be ready some time in june after 100+ commits. This is still brand new PR 😆

Jan 02 '25 00:01 brabebhin

The integration is nowhere near "production ready".

Nobody said that. It's about determining viable ways.

Given past integrations, expect this to be ready some time in june

Or even later when pursuing ways that aren't viable 😆

Jan 02 '25 00:01 softworkz

The ConvertASSImageToSoftwareBitmap function was created with the help of ChatGPT, so I’m not sure where it came from. However, I referenced multiple sources from different projects to ChatGPT, but none of them seem to work correctly.

It's good for trivial things (which are commonly known but tedious to look up) or getting a starting point (e.g. boilerplate code) or applying repetitive transformations (after you established/"tought" it how to do), but when there's one thing that's reliable about ChatGPT, then it's that it never gets things right at first shot when it's a little bit more complicated.

Also I wouldn't feed it multiple examples using different approaches. It's not clever enough to separate and distinguish. The ffmpeg implementation is fine but slow. Did you take a look at MPV player as had been suggested by others in the libass conversation?

PS: Even when you still want to try with ChatGPT - at some point it's inevitable to UUOB!

Jan 02 '25 01:01 softworkz

(use your own brain)

Jan 02 '25 01:01 softworkz

The problem is time syncing the animation from our end, because we have no real time sync

Jan 02 '25 11:01 brabebhin

Or even later when pursuing ways that aren't viable 😆

You are possibly right. Libass does look like it sits better in the rendering loop rather than the decoding loop. There are several limitations that we have in the demux+decode loop that we have, given how the API works.

We have no way of knowing the actual size of the video being rendered. In 99.99% of cases, this will be different than the native video size
Time sync is difficult - we get our sub chunks before they are actually rendered, libass seems to really like being in sync with the video position.
The so called ffmpeg - libass integration seems non existent. Aside from the ass and sub filters which are useless in the decoding loop, there's literally nothing else. We may as well not even enable libass in ffmpeg build script.

I personally don't see how we can get away without frame server mode - which is the only way we can have time sync, per frame sync with libass.

PotPlayer and the rest that use libass are players - we currently are not players. But it seems the only reliable way to do it is to be players.

Jan 02 '25 14:01 brabebhin

2. Time sync is difficult - we get our sub chunks before they are actually rendered, libass seems to really like being in sync with the video position.

If you keep ingesting packets from the decoder, then you can have processed ass subs for the whole video even before it starts playing. That's because they are stored without being interleaved with the video and audio stream. You get them all at once - just like it was an external ass file.

libass seems to really like being in sync with the video position.

Not really. You give it an arbitrary timestamp (in no order) and it renders the subtitle image for that time (plus it tells you whether the outpuit is
different from the previous render call you made.

3. The so called ffmpeg - libass integration seems non existent. Aside from the ass and sub filters which are useless in the decoding loop, there's literally nothing else. We may as well not even enable libass in ffmpeg build script.

Yes that's not needed if you compile it otherwise. The only reason to build it as part of ffmpeg would be when it's easier to compile within that context instead of compiling it separately.

Jan 02 '25 14:01 softworkz

PotPlayer and the rest that use libass are players - we currently are not players. But it seems the only reliable way to do it is to be players.

I know what you mean - but it's also a kind of paradoxon: How can FFmpegInteropX play video without being a player but can not play animated subtitles without being a player? 😆

Jan 02 '25 14:01 softworkz

I found out why the cue doesn't appear in the UI: the (x, y) cuePosition was set to 100. I changed it to 0, and the subtitle displayed correctly.

Yes I found that out yesterday evening as well, while testing out the branch. But I did not have time to cleanup and commit. Plus there is a bug with the video resolution. Libass is initialized with FullHD, but if the actual video has a different size, it will push the new size in NotifyVideoFrameSize. If they are different, the resulting image is either broken, or the rendering crashes due to out of bounds access. We should not mix up subtitle render size and native video size. These are two separate things. Also, we should delay-initialize libass, because it is recommended to also tell libass the native video size (which is only available when decode loop starts).

The ConvertASSImageToSoftwareBitmap function was created with the help of ChatGPT, so I’m not sure where it came from. However, I referenced multiple sources from different projects to ChatGPT, but none of them seem to work correctly.

The implementation is horribly inefficient. Using a vector here means having bounds check on every single pixel access (actually four times per pixel, once for each color). Then the whole thing is copied over to an IBuffer and finally SoftwareBitmap copies over everything once again. Thank you ChatGPT, well done ^^

I also tried animations again, but there are many dropped cues, and most of them don't show up. However, as you can see, it works fine when you render it yourself with a timer—it's fast and works (except for the color part, of course).

That's what I expected. It would work for static subs, but not for animations. We could consider offering both: Using libass with IMediaCue to improve static rendering of subtitles (without needing custom renderer integration into the client app), and a full blown solution with custom renderer and animations enabled.

No matter how we do it, we must provide a way for the client app to send the desired subtitle render size to our lib.

Jan 02 '25 14:01 lukasf

I know what you mean - but it's also a kind of paradoxon: How can FFmpegInteropX play video without being a player but can not play animated subtitles without being a player? 😆

Technically FFmpegInteropX doesn't play video.

If you keep ingesting packets from the decoder, then you can have processed ass subs for the whole video even before it starts playing. That's because they are stored without being interleaved with the video and audio stream. You get them all at once - just like it was an external ass file.

Does this actually happen all the time? If so, this complicates IMediaCue based implementation (static or otherwise) considerably, as we need to have all images either rendered in time or defer them somehow. It is weird because ffmpeg also automatically converts srt to ass (kinda funny considering ffmpeg does not seem capable of rendering the ass by itself outside of ffplay)

Not really. You give it an arbitrary timestamp (in no order) and it renders the subtitle image for that time (plus it tells you whether the outpuit is different from the previous render call you made.

You can, but I see no reason why you should. Technically we can give the user the ability to set animation FPS that's different than the video FPS. But even so, this is not always productive, as you may end up with frame tearing. As I said, the libass API seems designed to be rendered at the same FPS as the video FPS, and the whole ass_image seems to be very well suited for being burned directly onto the video frame.

That's what I expected. It would work for static subs, but not for animations. We could consider offering both: Using libass with IMediaCue to improve static rendering of subtitles (without needing custom renderer integration into the client app), and a full blown solution with custom renderer and animations enabled.

I think the reason cues are dropped is because the images themselves are computed in the CueEnter event. Since the blending from ChatGPT is very inefficient, the cue exists by the time the rendering is done.

I am not sure offering static cues via libass with IMediaCue should be a priority. The current parser that we have is good enough, and most of the problems come from MPE not supporting the whole TimedTextCue API. If we will have a custom libass renderer, then we can use that one to satisfy both static and animated rendering. Having 2 different APIs for libass might be confusing for users and it will be harder for us to maintain long term.

Thank you ChatGPT, well done ^^

Another funny part is that if you ask ChatGPT about UWP and FFmpeg, you will get back some code that looks suspiciously similar to our library ^^ We are the kind of library ChatGPT learns from lol.

Jan 02 '25 15:01 brabebhin

@lukasf For full blown rendering, all we have to do is ask the user for an IDirect3DSurface to render onto -> this will give us all the info we need: size, pixel format, etc and IDirect3DSurface can be used in a variety of APIs, from SwapChains to Image controls. It is also shared between UWP and winUI 3.

Jan 02 '25 15:01 brabebhin

Actually, the ConvertASSImageToSoftwareBitmap function is not only inefficient, but it is completely wrong. Every ass_image overwrites the pixels from the previous image, instead of blending it in. That's why the color are all wrong - last color wins ^^ Plus, it just sets the RGB color values straight, and adds cumulated alpha. But the image alpha format is pre-multiplied, so each color channel must be pre-multiplied with the alpha value as well.

The code from Octapus uses a large float[] buffer as intermediate image, which will lead to slightly better blend quality at the cost of 4x more memory consumption plus a second "un-multiply" phase. Not sure if that is a good idea. I'd rather stick with ARGB32 for now.

I have pushed the fixes to get basic subtitle rendering working. And I will try to fix and speedup the broken blend function.

Jan 02 '25 15:01 lukasf

oops, i may have forced you to do a conflict merge :(

Jan 02 '25 16:01 brabebhin

BTW, I noticed shared_ptr can be used with a custom deleter. We can use it like so:

shared_ptr<ASS_Library>(assLibrary, ass_library_done);

This should simplify our free memory logic.

Jan 02 '25 16:01 brabebhin

I changed ConvertASSImageToSoftwareBitmap to this one:

SoftwareBitmap ConvertASSImageToSoftwareBitmap(ASS_Image* assImage, int width, int height)
{
    if (width <= 0)
        width = 1920;
    if (height <= 0)
        height = 1080;
    if (!assImage) {
        throw std::invalid_argument("ASS_Image is null");
    }

    size_t pixelDataSize = width * height * 4;
    std::vector<uint8_t> pixelData(pixelDataSize, 0);

    for (ASS_Image* img = assImage; img != nullptr; img = img->next)
    {
        uint8_t* src = img->bitmap;
        int stride = img->stride;
        uint32_t color = img->color;

        uint8_t a = (color >> 24) & 0xFF;
        uint8_t r = (color >> 16) & 0xFF;
        uint8_t g = (color >> 8) & 0xFF;
        uint8_t b = color & 0xFF;

        for (int y = 0; y < img->h; ++y)
        {
            for (int x = 0; x < img->w; ++x)
            {
                int srcIndex = y * stride + x;
                int destIndex = ((img->dst_y + y) * width + (img->dst_x + x)) * 4;

                uint8_t srcAlpha = src[srcIndex];
                if (srcAlpha == 0) continue;

                uint8_t destAlpha = pixelData[destIndex + 3];

                float alphaFactor = srcAlpha / 255.0f;
                uint8_t srcR = static_cast<uint8_t>(r * alphaFactor);
                uint8_t srcG = static_cast<uint8_t>(g * alphaFactor);
                uint8_t srcB = static_cast<uint8_t>(b * alphaFactor);

                float blendFactor = alphaFactor / (alphaFactor + destAlpha / 255.0f);

                pixelData[destIndex + 0] = static_cast<uint8_t>(
                    (srcB + (pixelData[destIndex + 0] * (1 - blendFactor))) * blendFactor);
                pixelData[destIndex + 1] = static_cast<uint8_t>(
                    (srcG + (pixelData[destIndex + 1] * (1 - blendFactor))) * blendFactor);
                pixelData[destIndex + 2] = static_cast<uint8_t>(
                    (srcR + (pixelData[destIndex + 2] * (1 - blendFactor))) * blendFactor);

                pixelData[destIndex + 3] = static_cast<uint8_t>(
                    my_min(255u, static_cast<unsigned int>(destAlpha) + static_cast<unsigned int>(srcAlpha)));
            }
        }
    }

    auto buffer = winrt::Windows::Storage::Streams::Buffer(static_cast<uint32_t>(pixelDataSize));
    memcpy(buffer.data(), pixelData.data(), pixelDataSize);
    buffer.Length(static_cast<uint32_t>(pixelDataSize));

    BitmapPixelFormat pixelFormat = BitmapPixelFormat::Bgra8;
    BitmapAlphaMode alphaMode = BitmapAlphaMode::Premultiplied;
    SoftwareBitmap bitmap = SoftwareBitmap::CreateCopyFromBuffer(buffer, pixelFormat, width, height, alphaMode);

    return bitmap;
}

The subtitle quality has improved significantly (though there are still issues with colors, which have been adjusted).

Also, it seems that attached fonts and images are not rendered at all.

Jan 02 '25 17:01 ramtinak