whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Add the ability to output Karaoke subtitle as .ass

Open moisestohias opened this issue 2 years ago • 5 comments

The requested feature can be found here. Currently, in order to generate Karaoke subtitles in .ass format, I have to perform two separate passes. The first pass is used to obtain word-by-word time stamps, while the second pass is used to identify sentence boundaries. After obtaining sentence boundaries, I then use my script to insert time stamps (\k tags) into the subtitle file.

Unfortunately, stable-ts is unable to provide the natural boundary that I require.

Update

After a bit of research, I just realize, that getting time accurate time stamp for each word is impossible with whisper, since it wasn't trained to output, word-level accurate time stamp, there are other projects notably whisperX (they made a paper detailing their work) that builds on and improve to get word-by-word Karaoke style output.

BR

moisestohias avatar May 06 '23 05:05 moisestohias

Another vote from me for .ass format karaoke output.

zkvsky avatar Feb 07 '24 06:02 zkvsky

i've made a python script to convert the json-full output into ass subtitles with word-by-word ts. It still needs some work, but it is somewhat usable already:

./main -m $MODEL_FILE -f input.wav -ojf 
whispercppjf2ass.py input.wav.json input.ass

eadmaster avatar Feb 15 '24 12:02 eadmaster

i've made a python script to convert the json-full output into ass subtitles with word-by-word ts. It still needs some work, but it is somewhat usable already:

./main -m $MODEL_FILE -f input.wav -ojf 
whispercppjf2ass.py input.wav.json input.ass

Works nicely, needed to slightly modify it as it wasn't showing all characters and also improved the colors on it and position. I was using the -owts, --output-words [false ] output script for generating karaoke video and then combining that with the original mp4 positioning it below and extending the video canvas also works nicely all done with ffmpeg.

matijagrcic avatar Mar 11 '24 17:03 matijagrcic

i've made a python script to convert the json-full output into ass subtitles with word-by-word ts. It still needs some work, but it is somewhat usable already:

./main -m $MODEL_FILE -f input.wav -ojf 
whispercppjf2ass.py input.wav.json input.ass

I've modified Your script to keep the original spaces and rich punctuation, in outputs like the following one:

[00:00:39.520 --> 00:00:44.720]   - And today, I've got kind of a personal story for you.
[00:00:44.720 --> 00:00:45.920]   - Okay. [laughs]
[00:00:45.920 --> 00:00:51.040]   - But okay, so the baseline here is that I'm like, thankfully...
[00:00:51.040 --> 00:00:55.040]   I'm a healthy person. Like, I have no pre-existing conditions.
[00:00:55.040 --> 00:00:57.360]   I don't smoke. I don't drink. Nothing like that.
[00:00:57.360 --> 00:01:00.720]   I am and have been for, like, the vast majority of my life,
[00:01:00.720 --> 00:01:03.280]   like, very lucky that way. - Yeah.
[00:01:03.280 --> 00:01:09.840]   - And then, about 10 months ago, I started noticing blood in my poop.
[00:01:09.840 --> 00:01:12.480]   - Okay. - Like, it wasn't painful
[00:01:12.480 --> 00:01:14.000]   or anything. It didn't feel any different.
[00:01:14.000 --> 00:01:18.720]   It was just like a shocking red alarm, red alert.
[00:01:18.720 --> 00:01:21.120]   This is blood. - And was it just the once?
[00:01:21.120 --> 00:01:23.040]   - No, no, no. It was multiple times. It was every day.
[00:01:23.040 --> 00:01:25.440]   Yeah. And I was like, "This is weird."

Modified script:

#!/usr/bin/env python3
import json, sys, argparse

def read_input(infile):
    if infile == "-": return sys.stdin.read()
    with open(infile, "r") as f: return f.read()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('infile', nargs='?', default="-", help="input file, default to stdin if unspecified.")
    parser.add_argument('outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout, help="output file, default to stdout if unspecified")
    args = parser.parse_args()

    parsed_json = json.loads(read_input(args.infile))

    args.outfile.write('''[Script Info]
Converted using wisper.cpp
Title:
ScriptType: v4.00+
WrapStyle: 0
ScaledBorderAndShadow: yes
Collisions: Normal

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
Style: Default,Arial,24,&H00FFFFFF,&H000088EF,&H00000000,&H00666666,-1,0,0,0,100,100,0,0,1,1.5,0,8,0,0,20,1

[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
''')

    for line in parsed_json.get("transcription", []):
        text = "".join(token["text"] for token in line.get("tokens", []) if token["text"])
        if text == "â™": continue
        from_ts = line["timestamps"]["from"][1:-1].replace(",", ".")
        to_ts = line["timestamps"]["to"][1:-1].replace(",", ".")
        args.outfile.write(f"Dialogue: 1,{from_ts},{to_ts},Default,,0,0,0,fx,{''.join(
            "{{\\k{}}}{}".format(int(abs(t["offsets"]["to"] - t["offsets"]["from"]) / 10), t["text"])
            for t in line.get("tokens", [])
            if t["text"] and ord(t["text"][0]) <= 127 and t["text"] not in ["ª", "â"] and not t["text"].startswith(("[_TT", "[_BE", "♪"))
        )}\n")

if __name__ == "__main__":
    main()

zkvsky avatar Apr 08 '24 01:04 zkvsky

if text == "â�": continue

What's the encoding of the code?

Error Log

$ python json2ass.py a.wav.json a.wav.ass
Traceback (most recent call last):
  File "/path/to/whisper.cpp/sample-data/json2ass.py", line 44, in <module>
    main()
  File "/path/to/whisper.cpp/sample-data/json2ass.py", line 14, in main
    parsed_json = json.loads(read_input(args.infile))
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/whisper.cpp/sample-data/json2ass.py", line 6, in read_input
    with open(infile, "r") as f: return f.read()
                                        ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 2980-2981: invalid continuation byte

BTW, the json file contains Chinese charaters.

playgithub avatar Jul 24 '24 08:07 playgithub