Allow parsing of improperly formatted Subrip files?
At the moment the library does support parsing strictly valid Subrip files. To build tools to realign cues, one might need to be able to handle improperly formatted files.
The following faults come to my mind:
- Overlapping cues
- Cues with timecodes like this one:
00:28:41,1000 - Using dots for separating milliseconds
Are these goals in alignment with yours (therefore accepting PRs) or would you rather keep it a strict parser?
I wouldn't mind implementing a way of allow improperly formatted files.
The initial reason I made this lib was to easily manipulate subtitles, so it would seem logic to handle those files too.
May be better to implement this feature thru adding configurable modes to your code? Clients can use strict mode if they need to strictly validate files and some king of soft mode, which allow to parse file with mistakes.
Other elements I encountered that the parser failed on:
- Cue's starting with index 0 instead of index 1
- Cue's with timecodes like 00:00:07
I have to bump to this issue too. I would like to have option (maybe set as default) to accept "improper formatted" SRT files. Now the library is not really usable, if this can not handle, subtitle files comes from many sources, one have no control on that. The most improper formatting is as follows (others are fixed quite easily, but needed fixing too):
- overlapping Cue
- Cue starting with index 0 instead 1
Please if possible, fix this. Thank you.
Okay guys, I got it, I'll see what I can do.
There's already a PR to handle overlapping cues.
I needed this for the purpose of converting srt to webvtt and did some work which fit the needs of a specific project I'm working on (I'm not sure if it's pull-request-worthy), see here: https://github.com/thomaspeeters/captioning/commit/13202f46f474eefc99f91ee02f5f1ab6946cb32b
Things I allow in my "loose validation mode":
- Milliseconds aren't required for timecodes
- Trailing whitespace at the end of the file is allowed
- Equal begin and end timecodes are allowed by default in loose validation mode
- Overlapping timecodes are allowed
- Timecodes without leading zeroes in any of the time units are allowed
Things I encountered, but didn't allow in my implementation:
- Leading whiteline in the subtitle text
- Empty subtitle text
- Missing subtitle index/order number
@thomaspeeters I think it is very good start. What about to pull request?
I made a pull request. It's the first time I've done this, so I hope I did it right...