pysubs2
pysubs2 copied to clipboard
Better handling of files with unknown character encoding
As of 1.2.0, we default to UTF-8 encoding. If this is not correct, the user has to specify the proper encoding manually. To improve the experience, we could try some autodetection before bailing out, to improve UX.
This is already something that users are dealing with, see:
- #42
- https://github.com/smacke/ffsubsync/blob/c5ba26620610f87ac30303fef5ca7c38e5dc2b3b/ffsubsync/subtitle_parser.py#L106
Consider adding https://github.com/chardet/chardet as (optional?) dependency.
(This is another idea from the original pysubs
library.)
Consider adding https://github.com/chardet/chardet as (optional?) dependency.
chardet
and libmagic
guess wrong too often
i have much better experience with charset_normalizer
context: im parsing millions of subtitles from opensubtitles.org and many old subs have non-utf8 encodings
something like...
diff --git a/pysubs2/ssafile.py b/pysubs2/ssafile.py
index 1202a46..ee22ea9 100644
--- a/pysubs2/ssafile.py
+++ b/pysubs2/ssafile.py
@@ -53,7 +53,7 @@ class SSAFile(MutableSequence):
# ------------------------------------------------------------------------
@classmethod
- def load(cls, path: str, encoding: str="utf-8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
+ def load(cls, path: str, encoding: Optional[str]="utf8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
"""
Load subtitle file from given path.
@@ -67,7 +67,8 @@ class SSAFile(MutableSequence):
Arguments:
path (str): Path to subtitle file.
encoding (str): Character encoding of input file.
- Defaults to UTF-8, you may need to change this.
+ Default is "utf8".
+ Set to None to autodetect the encoding.
format_ (str): Optional, forces use of specific parser
(eg. `"srt"`, `"ass"`). Otherwise, format is detected
automatically from file contents. This argument should
@@ -100,6 +101,13 @@ class SSAFile(MutableSequence):
>>> subs3 = pysubs2.load("subrip-subtitles-with-fancy-tags.srt", keep_unknown_html_tags=True)
"""
+ if encoding == None:
+ # guess encoding
+ import charset_normalizer
+ with open(path, "rb") as fp:
+ content_bytes = fp.read()
+ charset_matches = charset_normalizer.from_bytes(content_bytes)
+ encoding = str(charset_matches.best().encoding)
with open(path, encoding=encoding) as fp:
return cls.from_file(fp, format_, fps=fps, **kwargs)
edit
- encoding = charset_matches.encoding
+ encoding = str(charset_matches.best().encoding)
also SSAFile.from_bytes
is missing
push
this is such an easy fix...
also
SSAFile.from_bytes
is missing
related: for my app, it would also be useful to ignore text encoding, and parse the raw bytes because one subtitle file can contain multiple text encodings, for example utf8 and latin1 and in that case, no "guess encoding" library will help see also https://github.com/Ousret/charset_normalizer/issues/405
@milahu I see your point, but I also don't like having str
and non-str
subtitle files... I think the answer is errors="surrogateescape"
, I will try to implement it for the next version of the library.