epr icon indicating copy to clipboard operation
epr copied to clipboard

words are missing or out of order

Open trzhong opened this issue 4 years ago • 9 comments

I've read a epub in Chinese language using epr on macos 10.15.4, python 3.7:

窦文涛:今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的都无数次采访过您,通过电话连线。今天终于是见着真人了,我觉得您真是很有风度的一眉立目的那么一款,没想到看上去很温婉。样子的时候,会觉得您是穿着警服有点横

And the content displayed in ibooks is:

窦文涛:今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的是第一次见到您,但是在我和傅见锋[2] 做的节目当中,我们好像都无数次采访过您,通过电话连线。今天终于是见着真人了,我觉得您真是很有风度的一位女士!原来他们做点好采访,我没见到您样子的时候, 会觉得您是穿着警服有点横 眉立目的那么一款,没想到看上去很温婉。~~会觉得您是穿着警服有点横~~

Not only this paragraph or this book but also many have this problem.

trzhong avatar May 12 '20 06:05 trzhong

This is crucial, I will try Chinese epub when I'm free,... Since, originally this only supported english... But I will try and have a look

wustho avatar May 12 '20 09:05 wustho

Hey, there. I just tried looking it up, seems like this is out of my capability, sorry. Hope someone else make PR about this issue... It probably has something to do with HTMLtoLines(HTMLParser) class if anyone cares to help fixing this...

wustho avatar May 12 '20 23:05 wustho

Since "textwrap.wrap()" cannot handle Chinese character properly, I try to add below codes in "HTMLtoLines.get_lines":

            else:
                w = width
                l = len(i)
                cjk_l = len(i.encode(encoding='UTF-8'))
                asc_l = int((l * 3 - cjk_l) / 3)
                if cjk_l > l:
                    w = int(w * l / (l * 2 - asc_l))
                text += textwrap.wrap(i, w) + [""]
        return text, self.imgs

Although it does display the content correctly, I don't think this is the best solution. I prefer a better wrap library.

trzhong avatar May 15 '20 15:05 trzhong

Wow, that's impressive troubleshooting... After I read your comment, I did some googling, and found this: https://bugs.python.org/issue24665

Indeed, as you said, textwrap.wrap() cannot handle Chinese character properly. And seems like issue regarding CJK support in textwrap is closed with rejected resolution based on confusions or some stuffs. So I think we won't get any support for non latin alphabet soon. For now I will add this issue as limitation in README while we're waiting for better wrap library as you suggested.

wustho avatar May 15 '20 22:05 wustho

@trzhong hey there,you might want to try https://github.com/aeosynth/bk as an alternative...

wustho avatar Jul 12 '20 11:07 wustho

I added support for wide characters to bk. There may be other issues, for example I don't know the line breaking rules for asian text.

1q84 by murakami rendered to 30 columns:

1q84

aeosynth avatar Jul 17 '20 06:07 aeosynth

I‘m still using my patch. Thx for the information.

trzhong avatar Sep 27 '20 15:09 trzhong

Finally, I found [rich] as a solution to replace [textwrap].

from rich import cells replace all [textwrap.text] with [cells.chop_cells]

That's all.

trzhong avatar Jan 17 '21 15:01 trzhong

Wow https://github.com/willmcgugan/rich seems so powerful and features rich, thanks for pointing that out, mate... I'll try to implement it to epy...

wustho avatar Jan 18 '21 00:01 wustho