striprtf icon indicating copy to clipboard operation
striprtf copied to clipboard

rtf_to_text() converts RTF cp1252 russian text bad

Open svladimirs opened this issue 1 year ago • 12 comments

striprtf 0.0.26

{\rtf1\ansi\ansicpg1251 {\rtf1\adeflang1025\ansi\ansicpg1251 rtf_to_text() converting RTFs cp1251 is well (Russian text).

{\rtf1\adeflang1025\ansi\ansicpg1252 But not cp1252: абвгдеёжзийклмнопрст -> àáâãä叿çèéêëìíîïðñò

encoding=... do not help.

This helps: https://ru.stackoverflow.com/questions/1145225/Ошибка-обработки-файлов-rtf-на-python?ysclid=lqagyqz7x5798462943 or rtf_to_text(rtf.read()).encode('cp1252').decode('ansi') test-rus.zip

svladimirs avatar Dec 18 '23 07:12 svladimirs

Hi, according to wikipedia cyrilic rtf should be encoded in cp1251 and not in cp1252. If I change the rtf content to cp1251 it works fine. cp1252 is the western encoding.

joshy avatar Dec 18 '23 16:12 joshy

MS Word 2016 (test-2016.zip) save with 1251, but new MS Word 2021 (or below) after 2016 save as 1252 (test-rus).

svladimirs avatar Dec 19 '23 01:12 svladimirs

If a file (whether it's RTF or any other encoding) lists the wrong encoding, you are going to get mojibake … I don't think there's anything striprtf can realistically do about buggy RTF files.

stevengj avatar Dec 19 '23 16:12 stevengj

I have created a small test case myself with word 365 and indeed it saves it with encoding 1252. I have no idea how in this case word finds out which is the right encoding. Some online rtf viewers (https://products.groupdocs.app/de/viewer/rtf, https://jumpshare.com/viewer/rtf) are also able to display the content correctly. Also Wordpad shows it correctly. The question is how do they figure out the right encoding?

joshy avatar Dec 20 '23 08:12 joshy

Thanks. I did like this: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('ansi') except: pass

svladimirs avatar Dec 20 '23 09:12 svladimirs

@svladimirs: Glad you got a workaround. If I am running your code I get: LookupError: unknown encoding: ansi. How can this run?

joshy avatar Dec 20 '23 10:12 joshy

The question is how do they figure out the right encoding?

Maybe they do charset detection?

stevengj avatar Dec 20 '23 13:12 stevengj

The question is how do they figure out the right encoding?

Maybe they do charset detection?

I tried the chardet library and it told me with nearly 80% confidence that the encoding is ISO-8859-8 which is Hebrew. What I tried:

y = 'àáâãä叿çèéêëìíîïðñò\n'.encode('cp1252')
import chardet
chardet.detect(y)
>>>{'encoding': 'ISO-8859-8', 'confidence': 0.7950708952163513, 'language': 'Hebrew'}

joshy avatar Dec 20 '23 14:12 joshy

@joshy, You're probably using python < 3.6. See 7.2.4.1. Text Encodings. https://docs.python.org/3.5/library/codecs.html https://docs.python.org/3.6/library/codecs.html Let's replace ansi -> mbcs.

From that link on stackoverflow author used .encode('iso-8859-1').decode('cp1251'), but I tried to write universal code. 'iso-8859-1' replaced by me to 'cp1252' because def rtf_to_text(text, encoding="cp1252", errors="strict").

https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170 "For platforms used in markets whose languages use large character sets, the best alternative to Unicode is MBCS". I thought ansi (mbcs) would be more versatile than cp1251.

What do you think about: def rtf_to_text(text, encoding="mbcs", errors="strict"). Will it work? Maybe the problem with 1251/1252 will go away?

svladimirs avatar Dec 21 '23 02:12 svladimirs

@svladimirs As you can see I am using python 3.9. image

Regarding to your proposals:

  • mbcs is windows only, so not really an option
  • def rtf_to_text(text, encoding="mbcs", errors="strict") is only used as a proposal. If there is an encoding in the file itself, like in newer versions of word, the encoding used in the rtf file is taken

joshy avatar Dec 26 '23 21:12 joshy

Well, mbcs won't work either... Then: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('cp1251') except: pass

svladimirs avatar Dec 29 '23 03:12 svladimirs

As a library I can't do that, you as a user can do that. The reason is that the specified encoding in the rtf file is correct and the library would convert it to a wrong encoding.

joshy avatar Jan 09 '24 08:01 joshy