pysubs2 icon indicating copy to clipboard operation
pysubs2 copied to clipboard

Better handling of files with unknown character encoding

Open tkarabela opened this issue 3 years ago • 3 comments

As of 1.2.0, we default to UTF-8 encoding. If this is not correct, the user has to specify the proper encoding manually. To improve the experience, we could try some autodetection before bailing out, to improve UX.

This is already something that users are dealing with, see:

  • #42
  • https://github.com/smacke/ffsubsync/blob/c5ba26620610f87ac30303fef5ca7c38e5dc2b3b/ffsubsync/subtitle_parser.py#L106

Consider adding https://github.com/chardet/chardet as (optional?) dependency.

(This is another idea from the original pysubs library.)

tkarabela avatar May 14 '21 20:05 tkarabela

Consider adding https://github.com/chardet/chardet as (optional?) dependency.

chardet and libmagic guess wrong too often i have much better experience with charset_normalizer

context: im parsing millions of subtitles from opensubtitles.org and many old subs have non-utf8 encodings

something like...

diff --git a/pysubs2/ssafile.py b/pysubs2/ssafile.py
index 1202a46..ee22ea9 100644
--- a/pysubs2/ssafile.py
+++ b/pysubs2/ssafile.py
@@ -53,7 +53,7 @@ class SSAFile(MutableSequence):
     # ------------------------------------------------------------------------
 
     @classmethod
-    def load(cls, path: str, encoding: str="utf-8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
+    def load(cls, path: str, encoding: Optional[str]="utf8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
         """
         Load subtitle file from given path.
 
@@ -67,7 +67,8 @@ class SSAFile(MutableSequence):
         Arguments:
             path (str): Path to subtitle file.
             encoding (str): Character encoding of input file.
-                Defaults to UTF-8, you may need to change this.
+                Default is "utf8".
+                Set to None to autodetect the encoding.
             format_ (str): Optional, forces use of specific parser
                 (eg. `"srt"`, `"ass"`). Otherwise, format is detected
                 automatically from file contents. This argument should
@@ -100,6 +101,13 @@ class SSAFile(MutableSequence):
             >>> subs3 = pysubs2.load("subrip-subtitles-with-fancy-tags.srt", keep_unknown_html_tags=True)
 
         """
+        if encoding == None:
+            # guess encoding
+            import charset_normalizer
+            with open(path, "rb") as fp:
+                content_bytes = fp.read()
+            charset_matches = charset_normalizer.from_bytes(content_bytes)
+            encoding = str(charset_matches.best().encoding)
         with open(path, encoding=encoding) as fp:
             return cls.from_file(fp, format_, fps=fps, **kwargs)
 

edit

-            encoding = charset_matches.encoding
+            encoding = str(charset_matches.best().encoding)

also SSAFile.from_bytes is missing

milahu avatar Dec 17 '23 13:12 milahu

push

this is such an easy fix...

also SSAFile.from_bytes is missing

related: for my app, it would also be useful to ignore text encoding, and parse the raw bytes because one subtitle file can contain multiple text encodings, for example utf8 and latin1 and in that case, no "guess encoding" library will help see also https://github.com/Ousret/charset_normalizer/issues/405

milahu avatar Mar 10 '24 12:03 milahu

@milahu I see your point, but I also don't like having str and non-str subtitle files... I think the answer is errors="surrogateescape", I will try to implement it for the next version of the library.

tkarabela avatar May 05 '24 14:05 tkarabela