python-opensubtitles icon indicating copy to clipboard operation
python-opensubtitles copied to clipboard

latin chars discarded

Open shei-pi opened this issue 4 years ago • 6 comments

The detect function can't always guess the right encoding, so the decompress methods falls back to 'utf-8' and ignoring the errors. Therefore the latin chars get discarded. There should be an optional param to force the decoding to be used when decoding the data.

shei-pi avatar May 13 '20 02:05 shei-pi

It seems a bit silly to always fall back to utf-8 when no character detection libraries are installed. data actually contains the encoding specified when subtitles are uploaded (even though this can be wrong if the author set it wrong) with the key "SubEncoding" so I feel like that should really be the first choice to fall back on.

CosmicHorrorDev avatar Jun 01 '20 03:06 CosmicHorrorDev

That's a really good point @LovecraftianHorror, I didn't realise we had that info on the response.

@jprabino could you change your PR #44 for testing this?

Thanks to both!

agonzalezro avatar Jun 02 '20 15:06 agonzalezro

Got it. Thanks @LovecraftianHorror!

shei-pi avatar Jun 02 '20 19:06 shei-pi

@LovecraftianHorror can you give an example? I've been trying to test this but couldn't find where the "SubEncoding" key is supposed to be. We get self.data using: self.data = self.xmlrpc.DownloadSubtitles(self.token, ids)

Here's the result of that.

`self.data

'status':'200 OK' 'data':[{'data': 'H4sIAAAAAAAAA5V9zY4c...zzOeZotQAA', 'idsubtitlefile': '1955339152'}] 'seconds':0.06 len():3`

'data' inside self.data has the subtitles in base64, by the time we convert from base64 we should have the encoding already. In any case I checked and couldn't find the "SubEncoding" anywhere.

shei-pi avatar Jun 17 '20 02:06 shei-pi

Oh sorry I should have looked into the library more to see how downloading actually worked. The "SubEncoding" is returned on the results from SearchSubtitles not on the download information. You might be able to add another parameter for passing in encodings for the subtitle ids with a dictionary to make them optional? I'm not really sure about the best way to do this then.

If you want another reference there's me getting the "SubEncoding" here for a search result and using it here for downloaded subtitles. I recently changed downloading to just directly download the content instead of decoding it which avoids the error entirely. It seems like you should be able to do that here except for when return_decoded_data is True which would force you to decode it. Maybe changing to that would be the best course of option since I doubt many people use return_decoded_data, but it would still be good to have a way to specify the decoding.

Hopefully, that gives some options. I should have looked into the code a bit more before commenting the first time.

CosmicHorrorDev avatar Jun 17 '20 02:06 CosmicHorrorDev

Thanks for the insights. 👍

tiago-portugal avatar Aug 29 '22 22:08 tiago-portugal