ebooklib
ebooklib copied to clipboard
get_items() for ebooklib.ITEM_DOCUMENT as ordered?
I have some epubs that have several books within it and for these a separate sequence ordering (Book 1 -> Chapter 1 ...; Book2 -> Chapter 1 ...). Via the get_item method, it seems that the sequence is out of order. Would it be possible to have it (the ITEM_DOCUMENT) ordered by page number?
Doing it outside EpubBook would quite cumbersome and error-prone (check Chapter listing etc..), e.g. for a sanity check:
Original Sequence captured:
[1, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 3, 4, 5, 6, 7, 8, 9]
There 2 independent sets of chapter sequences:
{1: [0, 4], 2: [5, 11], 3: [6, 15], 4: [7, 16], 5: [8, 17], 6: [9, 18], 7: [10, 19], 8: [12, 20], 9: [13, 21], 10: [1, 14], 11: [2], 12: [3]}
It would be great to have these kind of books ordered when extracting chapters.
I get chapters via:
def epub2thtml(epub_path):
book = epub.read_epub(epub_path)
chapters_unordered = []
for item in book.get_items():
if item.get_type() == ebooklib.ITEM_DOCUMENT:
chapters_unordered.append(item.get_content())
return chapters_unordered
The sequences above are a sanity check (assumption: first 30 character strings contains strings 'Chapter' and a numbering), performed via:
chapters = chapters_unordered
chapter_numbers = []
chapter_numbers_dict = {}
ordering_issue = False
for el in list_chapters:
chapter_number = [x for x in el[0:30].split(" ") if RepresentsInt(x) and "Chapter" in el[0:30].split(" ")]
if len(chapter_number)==1:
chapter_numbers.append(int(chapter_number[0]))
else:
if not ordering_issue:
print("Chapter-less ordering for:")
print("\t{}".format(" ".join(el[0:30].split(" ")[0:-1])))
ordering_issue = True
print("Original Sequence captured: \n\t{}".format(chapter_numbers))
set_of_ch_nums = sorted(list(set(chapter_numbers)))
set_of_indep_chapter_seqs = 0
len_list = []
for el in set_of_ch_nums:
seq = [i for i, x in enumerate(chapter_numbers) if x == el]
chapter_numbers_dict[el]= seq
len_list.append(len(seq))
if len(seq)>set_of_indep_chapter_seqs:
set_of_indep_chapter_seqs = len(seq)
if set_of_indep_chapter_seqs>1:
print("There {} independent sets of chapter sequences: \n\t{}".format(set_of_indep_chapter_seqs,chapter_numbers_dict))
for ind, x in enumerate(len_list):
if ind >= len(len_list) - 1:
break
if len_list[ind] < len_list[ind+1]:
print("\t-> Missing chapter for a sequence at {}".format(ind + 1))
def RepresentsInt(s):
try:
int(s)
return True
except ValueError:
return False
Of course, the above code is just for inspection, not for actually fixing the ordering (error-prone as numbering is not always consistent, or nonexistent in the first lines, e.g. if there is no chapter numbering for chapter 6 in book 2 -might be on extra page treated as own chapter - then it would be missing), therefore my question.
One could use get_pages_for_items()
for this right?
Method get_items()
returns in order the files were specified in content.opf file under
The problems is you could have some items which are not mentioned in the .toc or .spine, but that is not that rare.