captioning icon indicating copy to clipboard operation
captioning copied to clipboard

Allow parsing of improperly formatted Subrip files?

Open z38 opened this issue 10 years ago • 8 comments

At the moment the library does support parsing strictly valid Subrip files. To build tools to realign cues, one might need to be able to handle improperly formatted files.

The following faults come to my mind:

  • Overlapping cues
  • Cues with timecodes like this one: 00:28:41,1000
  • Using dots for separating milliseconds

Are these goals in alignment with yours (therefore accepting PRs) or would you rather keep it a strict parser?

z38 avatar Aug 18 '15 09:08 z38

I wouldn't mind implementing a way of allow improperly formatted files.

The initial reason I made this lib was to easily manipulate subtitles, so it would seem logic to handle those files too.

delphiki avatar Aug 23 '15 18:08 delphiki

May be better to implement this feature thru adding configurable modes to your code? Clients can use strict mode if they need to strictly validate files and some king of soft mode, which allow to parse file with mistakes.

vkhramtsov avatar Aug 24 '15 05:08 vkhramtsov

Other elements I encountered that the parser failed on:

  • Cue's starting with index 0 instead of index 1
  • Cue's with timecodes like 00:00:07

hartman avatar Sep 05 '15 20:09 hartman

I have to bump to this issue too. I would like to have option (maybe set as default) to accept "improper formatted" SRT files. Now the library is not really usable, if this can not handle, subtitle files comes from many sources, one have no control on that. The most improper formatting is as follows (others are fixed quite easily, but needed fixing too):

  • overlapping Cue
  • Cue starting with index 0 instead 1

Please if possible, fix this. Thank you.

2ge avatar Sep 26 '15 05:09 2ge

Okay guys, I got it, I'll see what I can do.

There's already a PR to handle overlapping cues.

delphiki avatar Sep 26 '15 11:09 delphiki

I needed this for the purpose of converting srt to webvtt and did some work which fit the needs of a specific project I'm working on (I'm not sure if it's pull-request-worthy), see here: https://github.com/thomaspeeters/captioning/commit/13202f46f474eefc99f91ee02f5f1ab6946cb32b

Things I allow in my "loose validation mode":

  • Milliseconds aren't required for timecodes
  • Trailing whitespace at the end of the file is allowed
  • Equal begin and end timecodes are allowed by default in loose validation mode
  • Overlapping timecodes are allowed
  • Timecodes without leading zeroes in any of the time units are allowed

Things I encountered, but didn't allow in my implementation:

  • Leading whiteline in the subtitle text
  • Empty subtitle text
  • Missing subtitle index/order number

thomaspeeters avatar Dec 10 '15 16:12 thomaspeeters

@thomaspeeters I think it is very good start. What about to pull request?

2ge avatar Mar 23 '16 10:03 2ge

I made a pull request. It's the first time I've done this, so I hope I did it right...

thomaspeeters avatar Mar 23 '16 13:03 thomaspeeters