ebooklib icon indicating copy to clipboard operation
ebooklib copied to clipboard

Error in parsing NAV document containing <a> without href attribute

Open wendong-li opened this issue 6 years ago • 0 comments

This can be reproduced on v0.17.1.

When parsing the NAV, the current implementation assumes the href attribute always exists in the a element.

def _parse_nav(self, data, base_path, navtype='toc'):
    ...

    def parse_list(list_node):
        items = []

        for item_node in list_node.findall('li'):
            ...
            link_node = item_node.find('a')

            if sublist_node is not None:
                ...
                if link_node is not None:
                    href = zip_path.normpath(zip_path.join(base_path, link_node.get('href')))
                    ...
            elif link_node is not None:
                title = link_node.text
                href = zip_path.normpath(zip_path.join(base_path, link_node.get('href')))

            ...

Otherwise, zip_path.join will throw exception 'NoneType' object has no attribute 'startswith'.

I guess this assumption is true for most cases, but here I run into some EPUB files in which it's not. Those EPUB files are the preview version of its full edition, and it kept the whole TOC section but removed some of the links inside, hence, a elements without href, e.g.

<a>Chapter 29</a>

And from the W3C, this seems to be allowed: https://www.w3.org/TR/2011/WD-html5-20110525/text-level-semantics.html#the-a-element

I guess it's not a common use case, but it would be nice if it can be handled.

wendong-li avatar Jan 22 '19 07:01 wendong-li