bookcorpus icon indicating copy to clipboard operation
bookcorpus copied to clipboard

epub2txt.py produces incorrect results for many epubs

Open shawwn opened this issue 4 years ago • 1 comments

Specifically this line: https://github.com/soskek/bookcorpus/blob/05a3f227d9748c2ee7ccaf93819d0e0236b6f424/epub2txt.py#L149

image

When I tried to convert a book on Tensorflow to text using this script, I noticed chapter 1 was being repeated multiple times.

The reason is that the Table of Contents looks similar to this:

ch1.html#section1
ch1.html#section2
ch1.html#section3
... ch2.html#section1 ch2.html#section2 ...

The epub2txt script iterates over this table of contents, splits "ch1.html#section1" to "ch1.html", then converts that to text. Then repeats for "ch1.html#section2", which converts the same chapter into text.

I have a fixed version here: https://github.com/shawwn/scrap/blob/afb699ee9c8181b3728b81fc410a31b66311f0d8/epub2txt#L158-L206

shawwn avatar Sep 01 '20 21:09 shawwn

Thank you! I'll fix it!

soskek avatar Sep 05 '20 07:09 soskek