pysrt
pysrt copied to clipboard
Captions whose text begins with Line Separator character are parsed as blank string
I occasionally see SRTs in which 1 or 2 captions begin with the Line Separator character, u2028. Those captions get incorrectly parsed as blank.
I believe the character originates in Word, and is carried over when transcript is copy-pasted to YouTube to use YouTube's transcript auto-timing function.
This character seems to act as a normal line break when in the middle or end of a caption; the issue only arises when it is the first character of the caption.
I think the parser to ignore this character.
VLC, for the record, ignores it and displays the caption normally.
Gotchas:
It may make sense to pre-process the file, replacing u2028 with a more compatible line break like \n
. We should be careful, though, not to inadvertently trigger the blank line state outlined in Issue 71 by having a caption start with \n
.
Example SRT that exhibits this problem:
1
00:00:08,330 --> 00:00:13,653
This caption starts with the character
u2028, which causes PySRT to see it as blank.
2
00:00:13,653 --> 00:00:18,305
This caption has a u2028 here:
which does not cause issues.
3
00:00:18,305 --> 00:00:22,906
This caption starts with a normal line break; VLC
and PySRT show it as blank as per Issue 71.
Output:
- Caption 1: VLC displays the caption, PySRT parses it as blank
- Caption 2: VLC and PySRT display the caption
- Caption 3: VLC and PySRT show the caption as blank
After some poking around, I've had success preprocessing my srt files with .replace('\n\u2028', '\n')
Will look through the pysrt code and submit a PR if I can find the best place/method to do this. Suggestions welcome.