webvtt-py icon indicating copy to clipboard operation
webvtt-py copied to clipboard

Better parsing of srt subtitles to remove double newlines/breaks

Open shubhank008 opened this issue 5 years ago • 4 comments

I am getting Malformed Exception in some of my srt files due to them having weird double line breaks which breaks your parser I think.
I tried fixing it by replacing 2 or 3 linebreaks with a single linebreak but it wasn't as accurate as regex or a proper approach would be, would appreciate if you can add it.

Example subtitle (part of it)

00:01:10.733 --> 00:01:12.272
Aren't you excited?

00:01:14.143 --> 00:01:17.942
Let's find another place 

to hide out this year,

and play video 
games until it blows over.

00:01:17.943 --> 00:01:19.942

That'll get us through half a day, no problem.

shubhank008 avatar Jun 02 '20 12:06 shubhank008

Another example

10
00:02:05,988 --> 00:02:10,987
CHAPITRE 12

BAPTÊME ET PARADIS DES DIEUX

11
00:02:13,278 --> 00:02:14,367
Je vois…

12
00:02:14,488 --> 00:02:17,747
Tu vas arrêter de travailler
pour M. Benno ?

13
00:02:19,368 --> 00:02:21,497
Oui. J’en ai parlé à Otto.

shubhank008 avatar Jun 03 '20 08:06 shubhank008

I'm also having this issue right now, torned between writing my own converter or pre-patching srt file to get rid of these line breaks

arqtiq avatar Jun 08 '20 18:06 arqtiq

I'm also having this issue right now, torned between writing my own converter or pre-patching srt file to get rid of these line breaks

I ended up writing a pre-patch to sanitize my srt files before reading them with webvtt, used a mix of both replace and regex to remove linebreaks and then keep on expanding that regex based on any other format mess I face

shubhank008 avatar Jun 09 '20 07:06 shubhank008

hi @shubhank008 - could you share your replace / regex that you used? running into the same issues!

kicks66 avatar Apr 17 '24 09:04 kicks66