striprtf
striprtf copied to clipboard
rtf_to_text() converts RTF cp1252 russian text bad
striprtf 0.0.26
{\rtf1\ansi\ansicpg1251 {\rtf1\adeflang1025\ansi\ansicpg1251 rtf_to_text() converting RTFs cp1251 is well (Russian text).
{\rtf1\adeflang1025\ansi\ansicpg1252 But not cp1252: абвгдеёжзийклмнопрст -> àáâãä叿çèéêëìíîïðñò
encoding=... do not help.
This helps: https://ru.stackoverflow.com/questions/1145225/Ошибка-обработки-файлов-rtf-на-python?ysclid=lqagyqz7x5798462943 or rtf_to_text(rtf.read()).encode('cp1252').decode('ansi') test-rus.zip
Hi, according to wikipedia cyrilic rtf should be encoded in cp1251 and not in cp1252. If I change the rtf content to cp1251 it works fine. cp1252 is the western encoding.
MS Word 2016 (test-2016.zip) save with 1251, but new MS Word 2021 (or below) after 2016 save as 1252 (test-rus).
If a file (whether it's RTF or any other encoding) lists the wrong encoding, you are going to get mojibake … I don't think there's anything striprtf can realistically do about buggy RTF files.
I have created a small test case myself with word 365 and indeed it saves it with encoding 1252. I have no idea how in this case word finds out which is the right encoding. Some online rtf viewers (https://products.groupdocs.app/de/viewer/rtf, https://jumpshare.com/viewer/rtf) are also able to display the content correctly. Also Wordpad shows it correctly. The question is how do they figure out the right encoding?
Thanks. I did like this: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('ansi') except: pass
@svladimirs: Glad you got a workaround. If I am running your code I get: LookupError: unknown encoding: ansi. How can this run?
The question is how do they figure out the right encoding?
Maybe they do charset detection?
I tried the chardet library and it told me with nearly 80% confidence that the encoding is ISO-8859-8 which is Hebrew. What I tried:
y = 'àáâãä叿çèéêëìíîïðñò\n'.encode('cp1252')
import chardet
chardet.detect(y)
>>>{'encoding': 'ISO-8859-8', 'confidence': 0.7950708952163513, 'language': 'Hebrew'}
@joshy, You're probably using python < 3.6. See 7.2.4.1. Text Encodings. https://docs.python.org/3.5/library/codecs.html https://docs.python.org/3.6/library/codecs.html Let's replace ansi -> mbcs.
From that link on stackoverflow author used .encode('iso-8859-1').decode('cp1251'), but I tried to write universal code. 'iso-8859-1' replaced by me to 'cp1252' because def rtf_to_text(text, encoding="cp1252", errors="strict").
https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170 "For platforms used in markets whose languages use large character sets, the best alternative to Unicode is MBCS". I thought ansi (mbcs) would be more versatile than cp1251.
What do you think about: def rtf_to_text(text, encoding="mbcs", errors="strict"). Will it work? Maybe the problem with 1251/1252 will go away?
@svladimirs As you can see I am using python 3.9.
Regarding to your proposals:
- mbcs is windows only, so not really an option
def rtf_to_text(text, encoding="mbcs", errors="strict")is only used as a proposal. If there is an encoding in the file itself, like in newer versions of word, the encoding used in the rtf file is taken
Well, mbcs won't work either... Then: decoded = rtf_to_text(rtf) try: decoded = decoded.encode('cp1252').decode('cp1251') except: pass
As a library I can't do that, you as a user can do that. The reason is that the specified encoding in the rtf file is correct and the library would convert it to a wrong encoding.