pysrt Captions whose text begins with Line Separator character are parsed as blank string

Captions whose text begins with Line Separator character are parsed as blank string

Open ontl opened this issue 3 years ago • 1 comments

I occasionally see SRTs in which 1 or 2 captions begin with the Line Separator character, u2028. Those captions get incorrectly parsed as blank.

I believe the character originates in Word, and is carried over when transcript is copy-pasted to YouTube to use YouTube's transcript auto-timing function.

This character seems to act as a normal line break when in the middle or end of a caption; the issue only arises when it is the first character of the caption.

I think the parser to ignore this character.

VLC, for the record, ignores it and displays the caption normally.

Gotchas: It may make sense to pre-process the file, replacing u2028 with a more compatible line break like \n. We should be careful, though, not to inadvertently trigger the blank line state outlined in Issue 71 by having a caption start with \n.

Example SRT that exhibits this problem:

1
00:00:08,330 --> 00:00:13,653
 This caption starts with the character
u2028, which causes PySRT to see it as blank.

2
00:00:13,653 --> 00:00:18,305
This caption has a u2028 here:  which does not cause issues.

3
00:00:18,305 --> 00:00:22,906

This caption starts with a normal line break; VLC
and PySRT show it as blank as per Issue 71.

Output:

Caption 1: VLC displays the caption, PySRT parses it as blank
Caption 2: VLC and PySRT display the caption
Caption 3: VLC and PySRT show the caption as blank

Jun 17 '21 02:06 ontl

After some poking around, I've had success preprocessing my srt files with .replace('\n\u2028', '\n')

Will look through the pysrt code and submit a PR if I can find the best place/method to do this. Suggestions welcome.

Jun 19 '21 02:06 ontl

pysrt pysrt copied to clipboard

Captions whose text begins with Line Separator character are parsed as blank string

pysrt
pysrt copied to clipboard