Feature: fix Korean mojibake

Open martinblech opened this issue 11 years ago • 2 comments

It would be great if ftfy could fix cases like this:

>>> s = u'¼Ò¸®¿¤ - »ç¶ûÇÏ´Â ÀÚ¿©'
>>> print s.encode('latin1').decode('euc_kr')
소리엘 - 사랑하는 자여

but it doesn't:

>>> print ftfy.fix_text_segment(s)
1⁄4Ò ̧®¿¤ - »ç¶ûÇÏ ́Â ÀÚ¿©

Source: http://media.yohan.net/7.html

Oct 02 '14 15:10 martinblech

Korean might actually be easier than the other cases, because they only use one legacy encoding, and it's multi-byte, so it should be possible to distinguish from other encodings.

Merged into #18.

Oct 02 '14 17:10 rspeer

I've updated #18 to be more specifically about single-byte encodings, which means that this is its own issue again.

Jul 10 '18 19:07 rspeer