python-ftfy
python-ftfy copied to clipboard
Feature: fix Korean mojibake
It would be great if ftfy could fix cases like this:
>>> s = u'¼Ò¸®¿¤ - »ç¶ûÇÏ´Â ÀÚ¿©'
>>> print s.encode('latin1').decode('euc_kr')
소리엘 - 사랑하는 자여
but it doesn't:
>>> print ftfy.fix_text_segment(s)
1⁄4Ò ̧®¿¤ - »ç¶ûÇÏ ́ ÀÚ¿©
Source: http://media.yohan.net/7.html
Korean might actually be easier than the other cases, because they only use one legacy encoding, and it's multi-byte, so it should be possible to distinguish from other encodings.
Merged into #18.
I've updated #18 to be more specifically about single-byte encodings, which means that this is its own issue again.