ebooklib icon indicating copy to clipboard operation
ebooklib copied to clipboard

get_items() for ebooklib.ITEM_DOCUMENT as ordered?

Open chrisoutwright opened this issue 3 years ago • 2 comments

I have some epubs that have several books within it and for these a separate sequence ordering (Book 1 -> Chapter 1 ...; Book2 -> Chapter 1 ...). Via the get_item method, it seems that the sequence is out of order. Would it be possible to have it (the ITEM_DOCUMENT) ordered by page number?

Doing it outside EpubBook would quite cumbersome and error-prone (check Chapter listing etc..), e.g. for a sanity check:

Original Sequence captured: 
	[1, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 3, 4, 5, 6, 7, 8, 9]
There 2 independent sets of chapter sequences: 
	{1: [0, 4], 2: [5, 11], 3: [6, 15], 4: [7, 16], 5: [8, 17], 6: [9, 18], 7: [10, 19], 8: [12, 20], 9: [13, 21], 10: [1, 14], 11: [2], 12: [3]}

It would be great to have these kind of books ordered when extracting chapters.

I get chapters via:

    def epub2thtml(epub_path):
        book = epub.read_epub(epub_path)
        chapters_unordered = []
        for item in book.get_items():

            if item.get_type() == ebooklib.ITEM_DOCUMENT:
                chapters_unordered.append(item.get_content())
         return chapters_unordered

The sequences above are a sanity check (assumption: first 30 character strings contains strings 'Chapter' and a numbering), performed via:


    chapters = chapters_unordered
    chapter_numbers = []
    chapter_numbers_dict = {}
    ordering_issue = False
    for el in list_chapters:
        chapter_number = [x for x in el[0:30].split(" ") if RepresentsInt(x) and "Chapter" in el[0:30].split(" ")]
        if len(chapter_number)==1:
            chapter_numbers.append(int(chapter_number[0]))

        else:
            if not ordering_issue:
                print("Chapter-less ordering for:")
            print("\t{}".format(" ".join(el[0:30].split(" ")[0:-1])))
            ordering_issue = True


    print("Original Sequence captured: \n\t{}".format(chapter_numbers))
    set_of_ch_nums = sorted(list(set(chapter_numbers)))
    set_of_indep_chapter_seqs = 0
    len_list = []
    for el in set_of_ch_nums:
        seq = [i for i, x in enumerate(chapter_numbers) if x == el]

        chapter_numbers_dict[el]= seq
        len_list.append(len(seq))
        if len(seq)>set_of_indep_chapter_seqs:
            set_of_indep_chapter_seqs = len(seq)


    if set_of_indep_chapter_seqs>1:
        print("There {} independent sets of chapter sequences: \n\t{}".format(set_of_indep_chapter_seqs,chapter_numbers_dict))
        for ind, x in enumerate(len_list):
            if ind >= len(len_list) - 1:
                break
            if len_list[ind] < len_list[ind+1]:
                print("\t-> Missing chapter for a sequence at {}".format(ind + 1))
    def RepresentsInt(s):
        try:
            int(s)
            return True
        except ValueError:
            return False

Of course, the above code is just for inspection, not for actually fixing the ordering (error-prone as numbering is not always consistent, or nonexistent in the first lines, e.g. if there is no chapter numbering for chapter 6 in book 2 -might be on extra page treated as own chapter - then it would be missing), therefore my question.

chrisoutwright avatar Dec 30 '20 02:12 chrisoutwright

One could use get_pages_for_items() for this right?

chrisoutwright avatar Dec 30 '20 02:12 chrisoutwright

Method get_items() returns in order the files were specified in content.opf file under (inside of epub file). If you want to get chapters in order the Authors wanted you should for sure use book.toc or book.spine.

The problems is you could have some items which are not mentioned in the .toc or .spine, but that is not that rare.

aerkalov avatar Feb 08 '21 00:02 aerkalov