whisperX icon indicating copy to clipboard operation
whisperX copied to clipboard

Korean Model Issues with Alignment and Transcription

Open Infinitay opened this issue 1 year ago • 0 comments

Terminal (Main) align_extend 2

E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.360]  안녕하세요. 저는 현우입니다. 한국어로 말해주세요.
[00:03.360 --> 00:09.120]  아무도 불가능한 길게 말하는 사람을 듣고 싶지 않습니다.
[00:09.120 --> 00:18.240]  그러나, 이런 길게 말하는 말은 한국어로 말하는 새로운 언어를 공부하는 경우에 정말 도움이 될 것입니다.
[00:18.240 --> 00:23.160]  영어로는 짧은 말과 길게 말하는 것과는 차이가 없죠.
[00:23.160 --> 00:24.160]  예를 들어,
[00:24.160 --> 00:30.620]  I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160]  You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260]  I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460]  The verbs themselves don't really change,
[00:43.460 --> 00:47.180]  so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.500 --> 01:15.060]  So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660]  you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440]  Again, you don't have to talk like this.
[01:30.920 --> 01:34.580]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:34.580 --> 01:38.320]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:38.320 --> 01:41.000]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:41.000 --> 01:44.560]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:44.560 --> 01:45.600]  가능할까?
[01:45.600 --> 01:49.600]  But you don't want to always talk like this either.
[01:49.600 --> 01:54.720]  나는 사자야, 너는 토끼야. 우리는 친구가 될 수 없어.
[01:54.720 --> 01:59.680]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:24.720 --> 02:32.720]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:32.720 --> 02:37.320]  마침 and 우연히 are connected together here.
[02:37.320 --> 02:41.120]  This one is so, this is but.
[02:41.120 --> 02:43.920]  집에만 있을 생각이었지만.
[02:43.920 --> 02:47.020]  Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:47.020 --> 02:49.020]  TALK TO ME IN KOREAN 에서 만나요!
[02:49.020 --> 02:49.520]  Bye!
Performing alignment...
[00:00.000 --> 00:00.502]  안녕하세요. 저는 현우입니다. 한국어로 말해주세요.
[00:01.360 --> 00:02.021]  아무도 불가능한 길게 말하는 사람을 듣고 싶지 않습니다.
[00:07.120 --> 00:08.282]  그러나, 이런 길게 말하는 말은 한국어로 말하는 새로운 언어를 공부하는 경우에 정말 도움이 될 것입니다.
[00:16.240 --> 00:16.821]  영어로는 짧은 말과 길게 말하는 것과는 차이가 없죠.
[00:21.160 --> 00:21.260]  예를 들어,
[00:24.160 --> 00:30.620]  I am a lion. You are a bunny. We can't be friends.
[00:30.620 --> 00:35.160]  You simply add more words like and, and so, and you say
[00:35.160 --> 00:41.260]  I'm a lion. And you're a bunny. So we can't be friends.
[00:41.260 --> 00:43.460]  The verbs themselves don't really change,
[00:43.460 --> 00:47.180]  so it's relatively easier to make longer sentences in English.
[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.500 --> 01:15.060]  So, without understanding how the verbs change forms to be linked with the following part,
[01:15.060 --> 01:18.660]  you can't really make your sentences more fluid and flexible,
[01:18.660 --> 01:24.860]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.860 --> 01:28.440]  Again, you don't have to talk like this.
[01:28.920 --> 01:29.522]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:32.580 --> 01:33.322]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:36.320 --> 01:36.781]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:39.000 --> 01:39.501]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:42.560 --> 01:42.640]  가능할까?
[01:45.600 --> 01:49.600]  But you don't want to always talk like this either.
[01:47.600 --> 01:48.161]  나는 사자야, 너는 토끼야. 우리는 친구가 될 수 없어.
[01:52.720 --> 01:53.321]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:22.720 --> 02:23.581]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:30.720 --> 02:30.840]  마침 and 우연히 are connected together here.
[02:37.320 --> 02:41.120]  This one is so, this is but.
[02:39.120 --> 02:39.381]  집에만 있을 생각이었지만.
[02:43.920 --> 02:47.020]  Then, I'll be waiting for you at TalkToMeInKorean.com.
[02:45.020 --> 02:45.160]  TALK TO ME IN KOREAN 에서 만나요!
[02:49.020 --> 02:49.520]  Bye!

E:\Applications\WhisperX>
Terminal (Not described below) No align_extend parameter

E:\Applications\WhisperX>whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx
C:\Users\REDACTED\AppData\Roaming\Python\Python39\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
  warnings.warn("No audio backend is available.")
[00:00.000 --> 00:03.320]  안녕하세요. 저는 현우에요. 한국어로 말해주세요.
[00:03.320 --> 00:09.060]  아무도 누군가에게 불가능한 길게 말하는 말을 듣고 싶지 않습니다.
[00:09.060 --> 00:18.280]  그러나, 이런 길게 말하는 말은 한국어 같은 새로운 언어를 공부할 때 정말 도움이 될 것입니다.
[00:18.280 --> 00:23.100]  영어에서는 짧은 말과 길게 말의 차이가 적습니다.
[00:23.100 --> 00:24.200]  예를 들어,
[00:24.200 --> 00:30.540]  I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200]  You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300]  I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420]  The verbs themselves don't really change,
[00:43.420 --> 00:47.220]  so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480]  But in Korean, with these three short sentences
[00:52.480 --> 00:58.720]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.720 --> 01:03.220]  You have to change the verb endings to form a longer sentence using them.
[01:03.220 --> 01:08.440]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.440 --> 01:15.080]  So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580]  You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340]  Again, you don't have to talk like this.
[01:30.840 --> 01:34.580]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:34.580 --> 01:38.280]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:38.280 --> 01:40.920]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:40.920 --> 01:44.520]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:44.520 --> 01:45.880]  가능할까?
[01:49.880 --> 01:54.920]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[01:54.920 --> 01:59.720]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:24.920 --> 02:32.920]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:32.920 --> 02:37.360]  마침 and 우연히 are connected together here.
[02:37.360 --> 02:41.120]  This one is so, this is but.
[02:41.120 --> 02:44.000]  집에만 있을 생각이었지만.
[02:44.000 --> 02:47.000]  Then I'll be waiting for you at TalkToMeInKorean.com
[02:47.000 --> 02:49.000]  TalkToMeInKorean에서 만나요!
[02:49.000 --> 02:49.500]  Bye!
Performing alignment...
[00:00.000 --> 00:00.482]  안녕하세요. 저는 현우에요. 한국어로 말해주세요.
[00:01.340 --> 00:02.062]  아무도 누군가에게 불가능한 길게 말하는 말을 듣고 싶지 않습니다.
[00:07.060 --> 00:08.102]  그러나, 이런 길게 말하는 말은 한국어 같은 새로운 언어를 공부할 때 정말 도움이 될 것입니다.
[00:16.280 --> 00:16.801]  영어에서는 짧은 말과 길게 말의 차이가 적습니다.
[00:21.100 --> 00:21.220]  예를 들어,
[00:24.200 --> 00:30.540]  I am a lion. You are a bunny. We can't be friends.
[00:30.540 --> 00:35.200]  You simply add more words like and, so, and you say
[00:35.200 --> 00:41.300]  I'm a lion. And you're a bunny. So, we can't be friends.
[00:41.300 --> 00:43.420]  The verbs themselves don't really change,
[00:43.420 --> 00:47.220]  so it's relatively easier to make longer sentences in English.
[00:47.220 --> 00:52.480]  But in Korean, with these three short sentences
[00:50.480 --> 00:51.061]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.720 --> 01:03.220]  You have to change the verb endings to form a longer sentence using them.
[01:01.220 --> 01:01.821]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.
[01:08.440 --> 01:15.080]  So without understanding how the verbs change forms to be linked with the following part,
[01:15.080 --> 01:18.580]  You can't really make your sentences more fluid and flexible
[01:18.580 --> 01:24.920]  and it'll be harder for you to understand native speakers when they mix and link various sentence parts.
[01:24.920 --> 01:28.340]  Again, you don't have to talk like this.
[01:28.840 --> 01:29.442]  나는 사자고 너는 토끼니까 우리는 친구가 될 수 없지만
[01:32.580 --> 01:33.322]  내가 배가 안 고플 때는 너를 잡아먹지 않으려고 노력하겠다는 약속은
[01:36.280 --> 01:36.761]  지금은 일단 해줄 수 있다고 볼 수 있는데
[01:38.920 --> 01:39.441]  100% 보장할 수는 없다는 점을 이해해줬으면 좋겠는데
[01:42.520 --> 01:42.600]  가능할까?
[01:47.880 --> 01:48.481]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[01:52.920 --> 01:53.501]  내가 배가 안 고파. 그러면 너를 안 잡아먹어. 노력할게.
[02:22.920 --> 02:23.721]  재미있는 책, 한국어 공부를 좀 하려고 어디로 가면 좋을지 아직 모르겠어요.
[02:30.920 --> 02:31.040]  마침 and 우연히 are connected together here.
[02:37.360 --> 02:41.120]  This one is so, this is but.
[02:39.120 --> 02:39.381]  집에만 있을 생각이었지만.
[02:44.000 --> 02:47.000]  Then I'll be waiting for you at TalkToMeInKorean.com
[02:45.000 --> 02:45.120]  TalkToMeInKorean에서 만나요!
[02:49.000 --> 02:49.500]  Bye!

E:\Applications\WhisperX>

Environment

OS: Windows 10 Python: 3.9.9 WhisperX: https://github.com/m-bain/whisperX/commit/e909f2f766b23b2000f2d95df41f9b844ac53e49 Whisper Model: Large Alignment Model: w11wo/wav2vec2-xls-r-300m-korean

Input

https://www.youtube.com/watch?v=GM2Ki_FmF5U (720p version, audio + video, mp4, I let WhisperX preprocess it)

WhisperX Command

whisperx --model large --language ko GM2Ki_FmF5U.mp4 --align_model wav2vec2-xls-r-300m-korean --output_dir examples/whisperx --align_extend 2

Issues

Note: Everything below is described using align_extend 2 as shown in the command above and the Terminal (Main) details above

Translating English to Korean when I don't want it to

In the input, the speaker uses a mix of English and Korean. In the video's introduction, English is spoken and later in the video, such as during the example sentences, he switches to Korean. Instead of returning the introduction in English, you can see that it instead translated the English sentences into Korean for some reason.

This behavior is inconsistent too. For example, at 0:00:24, English is spoken and WhisperX transcribes it in English. That is fine. However, as mentioned above, during the introduction it transcribed it from English into Korean. I have no clue why that is.

~~Alignment Issues~~

EDIT: SOLVED I remembered that in #7 there was mention of having to us Chinese instead of cn, and so took a look at the issue again and saw it was regarding alignment. I changed kr to Korean for the language parameter and this issue was resolved.

At first I tried passing in no align_extend parameter. That made the transcribed captions even worse. I then used align_extend 2 as given in the Japanese example in the README which improved my results. It made everything better, and the English portions are lined up. However, the issue is the first occurrence (0:00:52) of the following line below is not aligned properly at all:

나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.

The transcription is correct. The issue is the alignment. As you can see from the output of the command line, initially everything is correct before it performs the alignment


[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:52.480 --> 00:58.680]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:03.060 --> 01:08.500]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.

However, after the alignment, it seems like of the Korean sentences' alignments are less than a second long. Here is just one example, but if you look at the output above, you will notice it's the case for all Korean sentences post-alignment.


[00:47.180 --> 00:52.480]  But in Korean, with these three short sentences,
[00:50.480 --> 00:51.061]  나는 사자야. 너는 토끼야. 우리는 친구가 될 수 없어.
[00:58.680 --> 01:03.060]  You have to change the verb endings to form a longer sentence using them.
[01:01.060 --> 01:01.681]  나는 사자고, 너는 토끼니까, 우리는 친구가 될 수 없어.

Infinitay avatar Dec 22 '22 22:12 Infinitay