Subtitles validation
I try to validate subtitles with this:
import codecs
import pysrt
from charade.universaldetector import UniversalDetector
def is_valid_subtitle(path):
u = UniversalDetector()
for line in open(path, 'rb'):
u.feed(line)
u.close()
encoding = u.result['encoding']
source_file = codecs.open(path, 'rU', encoding=encoding, errors='replace')
try:
for _ in pysrt.SubRipFile.stream(source_file, error_handling=pysrt.SubRipFile.ERROR_RAISE):
pass
except pysrt.Error:
return False
except UnicodeEncodeError: # Workaround for https://github.com/byroot/pysrt/issues/12
pass
return True
But unfortunately for some subtitles it fails even though the file is a valid subtitle. For example this one: https://docs.google.com/open?id=0B2q9iBGZdj6qOXZrbFpiV2ozOHc I think there should be different kind of InvalidItem error. It could be subclassed to raise, in this case, EmptyText error.
Although, I'm not sure this should raise an error at all because this doesn't mean the item is invalid, it just has its text empty.
A convenience method in pysrt would be welcome to check for valid subtitles files.
I'm not sure if pysrt should consider an empty text as an error.
All I can say right now is that it's not an intended behavior, in fact I never tough about that possibility.
I think it's reasonable to consider them as valid unless you know some players that fail to parse them ?
I'm also ok to implement a kind of pysrt.validate(path, encoding=None), but i'm not sure of the best behavior:
- Should I just return a boolean or the error list ?
- If I fail to parse because of an encoding error should I raise or consider the file as invalid ?
Your feedback on this one is welcome.
My personal taste is pysrt.is_valid(path, ignore_encoding_errors=True) that checks for subtitle file error. When having issues with encoding, most readers will just display unreadable characters but will read the file anyway hence the ignore_encoding_errors.
I think it's important to dissociate structure validation and encoding issues.
I wouldn't use the error list but I agree that could be useful for some.