safaribooks
safaribooks copied to clipboard
Wrong order of chapters
The order of chapters is messed up.
- About The Author
- Chapter 5
- “About the Cover Illustration”
- Chapter 1, 2, 3, 4
- Chapter 6 <..>
Is it an issue with this specific title "Marko Lukša. “Kubernetes in Action.” only?
I tried to run the following command twice with the same outcome both times:
python3 safaribooks.py --cred "xxx:yyy" --kindle 9781617293726
I can confirm that it's not only this book, it happens to other books I've tested. Indeed the chapters are not in chronological order. There is a problem with the creation of the content.opf
file. Below shows the contents of content.opf
and Chapter 05 (right after <itemref idref="Author"/>
) doesn't belong in that particular line as well as <itemref idref="resources"/>
between Chapter 16 and 17 is in the wrong place.
<spine toc="ncx">
<itemref idref="titlepage"/>
<itemref idref="titl"/>
<itemref idref="Copyright"/>
<itemref idref="Dedication"/>
<itemref idref="btoc"/>
<itemref idref="toc"/>
<itemref idref="Preface"/>
<itemref idref="Acknowledgments"/>
<itemref idref="Book"/>
<itemref idref="Author"/>
<itemref idref="05"/> <!-- Chapter 5 is placed in the wrong order -->
<itemref idref="Cover"/>
<itemref idref="p1"/>
<itemref idref="01"/>
<itemref idref="02"/>
<itemref idref="p2"/>
<itemref idref="03"/>
<itemref idref="04"/>
<itemref idref="06"/>
<itemref idref="07"/>
<itemref idref="08"/>
<itemref idref="09"/>
<itemref idref="10"/>
<itemref idref="p3"/>
<itemref idref="11"/>
<itemref idref="12"/>
<itemref idref="13"/>
<itemref idref="14"/>
<itemref idref="15"/>
<itemref idref="16"/>
<itemref idref="resources"/> <!-- Resources is placed in the wrong order -->
<itemref idref="17"/>
<itemref idref="18"/>
<itemref idref="A"/>
<itemref idref="B"/>
<itemref idref="C"/>
<itemref idref="D"/>
<itemref idref="Index"/>
<itemref idref="Figures"/>
<itemref idref="Tables"/>
<itemref idref="Listings"/>
</spine>
It's either the python script parsing the wrong order and appending or there is some sort of re-arrangement causing the issue.
My Environment:
I'm running the latest commit (e016ad3) as of this post.
- OS: Ubuntu 21.10 x86_64
- Kernel: 5.13.0-20-generic
- Shell: bash 5.1.8
- Node: v12.22.5
- npm: v8.1.1
- Python3: v3.9.7
I can confirm that this happens quite a lot for me too, in fact basically every book i download. One book that you can test is Strategic Monoliths and Microservices: Driving Innovation Using Purposeful Architecture
.
The generated content.obf
for me is the following. You can notice for example chapter three coming after preface
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml" />
<item id="cover" href="cover.xhtml" media-type="application/xhtml+xml" />
<item id="pref00" href="pref00.xhtml" media-type="application/xhtml+xml" />
<item id="praise" href="praise.xhtml" media-type="application/xhtml+xml" />
<item id="halftitle" href="halftitle.xhtml" media-type="application/xhtml+xml" />
<item id="fm01" href="fm01.xhtml" media-type="application/xhtml+xml" />
<item id="title" href="title.xhtml" media-type="application/xhtml+xml" />
<item id="copyright" href="copyright.xhtml" media-type="application/xhtml+xml" />
<item id="contents" href="contents.xhtml" media-type="application/xhtml+xml" />
<item id="foreword" href="foreword.xhtml" media-type="application/xhtml+xml" />
<item id="preface" href="preface.xhtml" media-type="application/xhtml+xml" />
<item id="ch03" href="ch03.xhtml" media-type="application/xhtml+xml" />
<item id="acknowledgments" href="acknowledgments.xhtml" media-type="application/xhtml+xml" />
<item id="authors" href="authors.xhtml" media-type="application/xhtml+xml" />
<item id="part01" href="part01.xhtml" media-type="application/xhtml+xml" />
<item id="ch01" href="ch01.xhtml" media-type="application/xhtml+xml" />
<item id="ch02" href="ch02.xhtml" media-type="application/xhtml+xml" />
<item id="part02" href="part02.xhtml" media-type="application/xhtml+xml" />
<item id="ch04" href="ch04.xhtml" media-type="application/xhtml+xml" />
<item id="ch05" href="ch05.xhtml" media-type="application/xhtml+xml" />
<item id="ch06" href="ch06.xhtml" media-type="application/xhtml+xml" />
<item id="ch07" href="ch07.xhtml" media-type="application/xhtml+xml" />
<item id="part03" href="part03.xhtml" media-type="application/xhtml+xml" />
<item id="ch08" href="ch08.xhtml" media-type="application/xhtml+xml" />
<item id="ch09" href="ch09.xhtml" media-type="application/xhtml+xml" />
<item id="part04" href="part04.xhtml" media-type="application/xhtml+xml" />
<item id="ch10" href="ch10.xhtml" media-type="application/xhtml+xml" />
<item id="ch11" href="ch11.xhtml" media-type="application/xhtml+xml" />
<item id="ch12" href="ch12.xhtml" media-type="application/xhtml+xml" />
<item id="index" href="index.xhtml" media-type="application/xhtml+xml" />
I'm sure someone with better understanding of python and the code base can explain this better than I can, but the problem looks like it's related to line #567 in safaribooks.py.
Problem
When taking a closer look at the original code causing the problem:
result.extend([c for c in response["results"] if "cover" in c["filename"] or "cover" in c["title"]])
The code above will append an item to the "result" list if the dictionary variable contains the word "cover" inside of "filename" or "title" key's value. I'm certain if you look at the chapters or sections that are incorrectly ordered, it will contain the word "cover" in the title or filename, so the problem here is that we can't use in
operator because cover
can be seen in words like dis𝘤𝘰𝘷𝘦𝘳y
.
Workaround?
result.extend([c for c in response["results"] if "cover" == c["title"].lower() or "cover.xhtml" == c["filename"].lower() or "titlepage.xhtml" == c["filename"].lower()])
I've change the in
operator to ==
which should do an exact match of the string and I've added titlepage.xhtml
because there are books like API Security in Action (9781617296024)
& The Art of Network Penetration Testing (9781617296826)
that doesn't contain a cover.xhtml
file, instead it's called titlepage.xhtml
.
That sorted out the wrong order of chapters, but now there's an issue with the books mentioned above where the cover image from titlepage.xhtml
isn't downloaded when using the workaround code above.