pysub-parser
pysub-parser copied to clipboard
Library for extracting text and timestamps from multiple subtitle files (.ass, .ssa, .srt, .sub, .txt).
pysub-parser
Utility to extract the contents of a subtitle file.
Supported types:
-
ass
: Advanced SubStation Alpha -
ssa
: SubStation Alpha -
srt
: SubRip -
sub
: MicroDVD -
txt
: Sub Viewer
For more information: http://write.flossmanuals.net/video-subtitling/file-formats
Usage
The method parse requires the following parameters:
-
path
: location of the subtitle file. -
subtype
: one of the supported file types, by default file extension is used. -
encoding
: encoding of the file,utf-8
by default. -
**kwargs
: optional parameters.-
fps
: framerate (only used bysub
files),23.976
by default.
-
from pysubparser import parser
subtitles = parser.parse('./files/space-jam.srt')
for subtitle in subtitles:
print(subtitle)
Output:
0 > [BALL BOUNCING]
1 > Michael?
2 > What are you doing out here, son? It's after midnight.
3 > MICHAEL: Couldn't sleep, Pops.
Subtitle Class
Each line of a dialogue is represented with a Subtitle
object with the following properties:
-
index
: position in the file. -
start
: timestamp of the start of the dialog. -
end
: timestamp of the end of the dialog. -
text
: dialog contents.
for subtitle in subtitles:
print(f'{subtitle.start} > {subtitle.end}')
print(subtitle.text)
print()
Output:
00:00:36.328000 > 00:00:38.329000
[BALL BOUNCING]
00:01:03.814000 > 00:01:05.189000
Michael?
00:01:08.402000 > 00:01:11.404000
What are you doing out here, son? It's after midnight.
00:01:11.572000 > 00:01:13.072000
MICHAEL: Couldn't sleep, Pops.
Cleaners
Currently, 4 cleaners are provided:
-
ascii
will translate every unicode character to its ascii equivalent. -
brackets
will remove anything between them (e.g.,[BALL BOUNCING]
) -
formatting
will remove formatting keys like<i>
and</i>
. -
lower_case
will lower case all text.
from pysubparser.cleaners import ascii, brackets, formatting, lower_case
subtitles = brackets.clean(
lower_case.clean(
subtitles
)
)
for subtitle in subtitles:
print(subtitle)
0 >
1 > michael?
2 > what are you doing out here, son? it's after midnight.
3 > michael: couldn't sleep, pops.
Writers
Given any list of Subtitle
and a path it will output those subtitles in a srt
format.