pysrt icon indicating copy to clipboard operation
pysrt copied to clipboard

Captions whose text begins with Line Separator character are parsed as blank string

Open ontl opened this issue 3 years ago • 1 comments

I occasionally see SRTs in which 1 or 2 captions begin with the Line Separator character, u2028. Those captions get incorrectly parsed as blank.

I believe the character originates in Word, and is carried over when transcript is copy-pasted to YouTube to use YouTube's transcript auto-timing function.

This character seems to act as a normal line break when in the middle or end of a caption; the issue only arises when it is the first character of the caption.

I think the parser to ignore this character.

VLC, for the record, ignores it and displays the caption normally.

Gotchas: It may make sense to pre-process the file, replacing u2028 with a more compatible line break like \n. We should be careful, though, not to inadvertently trigger the blank line state outlined in Issue 71 by having a caption start with \n.

Example SRT that exhibits this problem:

1
00:00:08,330 --> 00:00:13,653

This caption starts with the character
u2028, which causes PySRT to see it as blank.

2
00:00:13,653 --> 00:00:18,305
This caption has a u2028 here:
 which does not cause issues.

3
00:00:18,305 --> 00:00:22,906

This caption starts with a normal line break; VLC
and PySRT show it as blank as per Issue 71.

Output:

  • Caption 1: VLC displays the caption, PySRT parses it as blank
  • Caption 2: VLC and PySRT display the caption
  • Caption 3: VLC and PySRT show the caption as blank

ontl avatar Jun 17 '21 02:06 ontl

After some poking around, I've had success preprocessing my srt files with .replace('\n\u2028', '\n')

Will look through the pysrt code and submit a PR if I can find the best place/method to do this. Suggestions welcome.

ontl avatar Jun 19 '21 02:06 ontl