Extract Navigation (Table of Contents)
It would be nice for this library to extract the navigational elements from the EPUB file. It will probably need to support EPUB versions 3 and 2 to support most epub files out there.
I've created a gist https://gist.github.com/aymanosman/960a16a9cb5324a474fb541fe0feacbb that demonstrates what that might look like.
The current parser extracts the navigation, for example, if I parse the Elixir.epub I get the following:
iex(1)> BUPE.parse "Elixir.epub"
%BUPE.Config{
title: "Elixir - 1.18.3",
creator: nil,
contributor: nil,
date: nil,
identifier: "urn:uuid:4f88c473-7742-7960-977e-8651832447a5",
unique_identifier: "project-Elixir",
source: nil,
type: nil,
modified: "2025-03-06T10:06:03Z",
description: nil,
format: nil,
coverage: nil,
publisher: nil,
relation: nil,
rights: nil,
subject: nil,
logo: nil,
language: "en",
version: "3.0",
pages: [
%BUPE.Item{
duration: nil,
fallback: nil,
href: "nav.xhtml",
id: "nav",
media_overlay: nil,
media_type: "application/xhtml+xml",
description: nil,
properties: "nav scripted",
content: "<!DOCTYPE html>\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\"\n xmlns:epub=\"http://www.idpf.org/2007/ops\">\n <head>\n <meta charset=\"utf-8\" />\n <title>Table Of Contents - Elixir v1.18.3</title>\n <meta name=\"generator\" content=\"ExDoc v0.37.2\" />\n <link type=\"text/css\" rel=\"stylesheet\" href=\"dist/epub-elixir-FNUUKFP7.css\" />\n <script src=\"dist/epub-4WIP524F.js\"></script>\n\n </head>\n <body class=\"content-inner\">\n\n <h1>Table of contents</h1>\n <nav epub:type=\"toc\">\n <ol>\n\n\n\n <li><a href=\"changelog.xhtml\">Changelog for Elixir v1.18</a></li>\n\n\n\n\n <li><span>Getting started</span>\n <ol>\n\n\n <li><a href=\"introduction.xhtml\">Introduction</a></li>\n\n <li><a href=\"basic-types.xhtml\">Basic types</a></li>\n\n <li><a href=\"lists-and-tuples.xhtml\">Lists and tuples</a></li>\n\n <li><a href=\"pattern-matching.xhtml\">Pattern matching</a></li>\n\n <li><a href=\"case-cond-and-if.xhtml\">case, cond, and if</a></li>\n\n <li><a href=\"anonymous-functions.xhtml\">Anonymous functions</a></li>\n\n <li><a href=\"binaries-strings-and-charlists.xhtml\">Binaries, strings, and charlists</a></li>\n\n <li><a href=\"keywords-and-maps.xhtml\">Keyword lists and maps</a></li>\n\n <li><a href=\"modules-and-functions.xhtml\">Modules and functions</a></li>\n\n <li><a href=\"recursion.xhtml\">Recursion</a></li>\n\n <li><a href=\"enumerable-and-streams.xhtml\">Enumerables and Streams</a></li>\n\n <li><a href=\"processes.xhtml\">Processes</a></li>\n\n <li><a href=\"io-and-the-file-system.xhtml\">IO and the file system</a></li>\n\n <li><a href=\"alias-require-and-import.xhtml\">alias, require, import, and use</a></li>\n\n <li><a href=\"module-attributes.xhtml\">Module attributes</a></li>\n\n <li><a href=\"structs.xhtml\">Structs</a></li>\n\n <li><a href=\"protocols.xhtml\">Protocols</a></li>\n\n <li><a href=\"comprehensions.xhtml\">Comprehensions</a></li>\n\n <li><a href=\"sigils.xhtml\">Sigils</a></li>\n\n <li><a href=\"try-catch-and-rescue.xhtml\">try, catch, and rescue</a></li>\n\n <li><a href=\"writing-documentation.xhtml\">Writing documentation</a></li>\n\n <li><a href=\"optional-syntax.xhtml\">Optional syntax sheet</a></li>\n\n <li><a href=\"erlang-libraries.xhtml\">Erlang libraries</a></li>\n\n <li><a href=\"debugging.xhtml\">Debugging</a></li>\n\n\n </ol>\n </li>\n\n\n\n <li><span>Cheatsheets</span>\n <ol>\n\n\n <li><a href=\"enum-cheat.xhtml\">Enum cheatsheet</a></li>\n\n\n </ol>\n </li>\n\n\n\n <li><span>Anti-patterns</span>\n <ol>\n\n\n <li><a href=\"what-anti-patterns.xhtml\">What are anti-patterns?</a></li>\n\n <li><a href=\"code-anti-patterns.xhtml\">Code-related anti-patterns</a></li>\n\n <li><a href=\"design-anti-patterns.xhtml\">Design-related anti-patterns</a></li>\n\n <li><a href=\"process-anti-patterns.xhtml\">Process-related anti-patterns</a></li>\n\n <li><a href=\"macro-anti-patterns.xhtml\">Meta-programming anti-patterns</a></li>\n\n\n </ol>\n </li>\n\n\n\n <li><span>Meta-programming</span>\n <ol>\n\n\n <li><a href=\"quote-and-unquote.xhtml\">Quote and unquote</a></li>\n\n <li><a href=\"macros.xhtml\">Macros</a></li>\n\n <li><a href=\"domain-specific-languages.xhtml\">Domain-Specific Languages (DSLs)</a></li>\n\n\n </ol>\n </li>\n\n\n\n <li><span>Mix & OTP</span>\n <ol>\n\n\n <li><a href=\"introduction-to-mix.xhtml\">Introduction to Mix</a></li>\n\n <li><a href=\"agents.xhtml\">Simple state management with agents</a></li>\n\n <li><a href=\"genservers.xhtml\">Client-server communication with GenServer</a></li>\n\n <li><a href=\"supervisor-and-application.xhtml\">Supervision trees and applications</a></li>\n\n <li><a href=\"dynamic-supervisor.xhtml\">Supervising dynamic children</a></li>\n\n <li><a href=\"erlang-term-storage.xhtml\">Speeding up with ETS</a></li>\n\n <li><a href=\"dependencies-and-umbrella-projects.xhtml\">Dependencies and umbrella projects</a></li>\n\n <li><a href=\"task-and-gen-tcp.xhtml\">Task and gen_tcp</a></li>\n\n <li><a href=\"docs-tests-and-with.xhtml\">Doctests, patterns, and wit" <> ...
},
%BUPE.Item{duration: nil, fallback: nil, href: "debugging.xhtml", ...},
%BUPE.Item{duration: nil, fallback: nil, ...},
%BUPE.Item{duration: nil, ...},
%BUPE.Item{...},
...
],
nav: [
%{idref: "cover"},
%{idref: "nav"},
%{idref: "changelog"},
%{idref: "introduction"},
%{idref: "basic-types"},
%{idref: "lists-and-tuples"},
%{idref: "pattern-matching"},
%{idref: "case-cond-and-if"},
%{idref: "anonymous-functions"},
%{idref: "binaries-strings-and-charlists"},
%{idref: "keywords-and-maps"},
%{idref: "modules-and-functions"},
%{idref: "recursion"},
%{idref: "enumerable-and-streams"},
%{idref: "processes"},
%{idref: "io-and-the-file-system"},
%{idref: "alias-require-and-import"},
%{idref: "module-attributes"},
%{idref: "structs"},
%{idref: "protocols"},
%{idref: "comprehensions"},
%{idref: "sigils"},
%{idref: "try-catch-and-rescue"},
%{idref: "writing-documentation"},
%{idref: "optional-syntax"},
%{idref: "erlang-libraries"},
%{idref: "debugging"},
%{idref: "enum-cheat"},
%{...},
...
],
styles: [
%BUPE.Item{
duration: nil,
fallback: nil,
href: "dist/epub-elixir-FNUUKFP7.css",
id: "epub-elixir-fnuukfp7-css",
media_overlay: nil,
media_type: "text/css",
description: nil,
properties: nil,
content: ":root{--main: hsl(250, 68%, 69%);--mainDark: hsl(250, 68%, 59%);--mainDarkest: hsl(250, 68%, 49%);--mainLight: hsl(250, 68%, 74%);--mainLightest: hsl(250, 68%, 79%);--searchBarFocusColor: #8E7CE6;--searchBarBorderColor: rgba(142, 124, 230, .25);--link-color: var(--mainDark);--link-visited-color: var(--mainDarkest)}body.dark{--link-color: var(--mainLightest);--link-visited-color: var(--mainLight)}:root{--content-width: 949px;--content-gutter: 60px;--borderRadius-lg: 14px;--borderRadius-base: 8px;--borderRadius-sm: 3px;--navTabBorderWidth: 2px;--sansFontFamily: \"Lato\", system-ui, Segoe UI, Roboto, Helvetica, Arial, sans-serif, \"Apple Color Emoji\", \"Segoe UI Emoji\";--monoFontFamily: ui-monospace, SFMono-Regular, Consolas, Liberation Mono, Menlo, monospace;--baseLineHeight: 1.5em;--gray25: hsl(207, 43%, 98%);--gray50: hsl(207, 43%, 96%);--gray100: hsl(212, 33%, 91%);--gray200: hsl(210, 29%, 88%);--gray300: hsl(210, 26%, 84%);--gray400: hsl(210, 21%, 64%);--gray450: hsl(210, 21%, 49%);--gray500: hsl(210, 21%, 34%);--gray600: hsl(210, 27%, 26%);--gray700: hsl(212, 35%, 17%);--gray750: hsl(214, 46%, 14%);--gray800: hsl(216, 52%, 11%);--gray800-opacity-0: hsla(216, 52%, 11%, 0%);--gray850: hsl(216, 63%, 8%);--gray900: hsl(218, 73%, 4%);--gray900-opacity-50: hsla(218, 73%, 4%, 50%);--gray900-opacity-0: hsla(218, 73%, 4%, 0%);--coldGrayFaint: hsl(240, 5%, 97%);--coldGrayLight: hsl(240, 5%, 88%);--coldGray-lightened-10: hsl(240, 5%, 56%);--coldGray: hsl(240, 5%, 46%);--coldGray-opacity-10: hsla(240, 5%, 46%, 10%);--coldGrayDark: hsl(240, 5%, 28%);--coldGrayDim: hsl(240, 5%, 18%);--yellowLight: hsl(43, 100%, 95%);--yellowDark: hsl(44, 100%, 15%);--yellow: hsl(60, 100%, 43%);--green-lightened-10: hsl(90, 100%, 45%);--green: hsl(90, 100%, 35%);--white: hsl(0, 0%, 100%);--white-opacity-50: hsla(0, 0%, 100%, 50%);--white-opacity-10: hsla(0, 0%, 100%, 10%);--white-opacity-0: hsla(0, 0%, 100%, 0%);--black: hsl(0, 0%, 0%);--black-opacity-10: hsla(0, 0%, 0%, 10%);--black-opacity-50: hsla(0, 0%, 0%, 50%);--orangeDark: hsl(30, 90%, 40%);--orangeLight: hsl(30, 80%, 50%);--text-xs: .75rem;--text-sm: .875rem;--text-md: 1rem;--text-lg: 1.125rem;--text-xl: 1.25rem;--transition-duration: .15s;--transition-timing: cubic-bezier(.4, 0, .2, 1);--transition-all: all var(--transition-duration) var(--transition-timing);--transition-colors: color var(--transition-duration) var(--transition-timing), background-color var(--transition-duration) var(--transition-timing), border-color var(--transition-duration) var(--transition-timing), text-decoration-color var(--transition-duration) var(--transition-timing), fill var(--transition-duration) var(--transition-timing), stroke var(--transition-duration) var(--transition-timing);--transition-opacity: opacity var(--transition-duration) var(--transition-timing)}@media screen and (max-width: 768px){:root{--content-width: 100%;--content-gutter: 20px}}option{background-color:var(--sidebarBackground)}:root{--background: var(--white);--contrast: var(--black);--textBody: var(--gray800);--textHeaders: var(--gray900);--textDetailAccent: var(--mainLight);--textDetailBackground: var(--coldGrayFaint);--iconAction: var(--coldGray);--iconActionHover: var(--gray800);--blockquoteBackground: var(--coldGrayFaint);--blockquoteBorder: var(--coldGrayLight);--tableHeadBorder: var(--gray100);--tableBodyBorder: var(--gray50);--warningBackground: hsl( 33, 100%, 97%);--warningHeadingBackground: hsl( 33, 87%, 64%);--warningHeading: var(--black);--errorBackground: hsl( 7, 81%, 96%);--errorHeadingBackground: hsl( 6, 80%, 60%);--errorHeading: var(--white);--infoBackground: hsl(206, 91%, 96%);--infoHeadingBackground: hsl(213, 92%, 62%);--infoHeading: var(--white);--neutralBackground: hsl(212, 29%, 92%);--neutralHeadingBackground: hsl(220, 43%, 11%);--neutralHeading: var(--white);--tipBackground: hsl(142, 31%, 93%);--tipHeadingBackground: hsl(134, 39%, 36%);--tipHeading: var(--white);--fnSpecAttr: var(--coldGray);--fnDeprecated: var(--yellowLight);--blink: var(--yellowLight);--codeBackground: var(--gray25);--codeBorder: var(--gray100);--codeScroll" <> ...
}
],
scripts: [],
images: [
%BUPE.Item{
duration: nil,
fallback: nil,
href: "assets/kv-observer.png",
id: "kv-observer-png",
media_overlay: nil,
media_type: "image/png",
description: nil,
properties: nil,
content: <<137, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13, 73, 72, 68, 82,
...>>
},
],
cover: true,
audio: nil,
fonts: nil,
toc: nil
}
As you can see, one of the keys is nav, and there you can see the navigation elements.
Is this what you're expecting?
What you parse as nav is not enough. From reading the code, I think that is just the "spine" section of the manifest.
Take Moby Dick, for example.
When parsed with bupe, you get a nav section like this
[
%{idref: "coverpage-wrapper"},
%{idref: "pg-header"},
%{idref: "item5"},
%{idref: "item6"},
%{idref: "item7"},
%{idref: "item8"},
%{idref: "item9"},
%{idref: "item10"},
%{idref: "item11"},
%{idref: "item12"},
%{idref: "item13"},
%{idref: "pg-footer"}
]
When parsed with my gist, it looks like this
%{
toc: [
%{
label: ~c"MOBY-DICK; or, THE WHALE.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00000",
children: []
},
%{
label: [79, 114, 105, 103, 105, 110, 97, 108, 32, 84, 114, 97, 110, 115, 99, 114, 105, 98,
101, 114, 8217, 115, 32, 78, 111, 116, 101, 115, 58],
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00001",
children: []
},
%{
label: ~c"ETYMOLOGY.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00002",
children: [
%{
label: ~c"(Supplied by a Late Consumptive Usher to a Grammar School.)",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00003",
children: []
}
]
},
%{
label: ~c"EXTRACTS. (Supplied by a Sub-Sub-Librarian).",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00004",
children: [
%{
label: ~c"EXTRACTS.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00005",
children: []
}
]
},
%{
label: ~c"CHAPTER 1. Loomings.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00006",
children: []
},
%{
label: ~c"CHAPTER 2. The Carpet-Bag.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00007",
children: []
},
%{
label: ~c"CHAPTER 3. The Spouter-Inn.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00008",
children: []
},
%{
label: ~c"CHAPTER 4. The Counterpane.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00009",
children: []
},
%{
label: ~c"CHAPTER 5. Breakfast.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00010",
children: []
},
%{
label: ~c"CHAPTER 6. The Street.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00011",
children: []
},
%{
label: ~c"CHAPTER 7. The Chapel.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00012",
children: []
},
%{
label: ~c"CHAPTER 8. The Pulpit.",
href: ~c"8921354174505514122_2701-h-0.htm.xhtml#pgepubid00013",
children: []
},
%{
label: ~c"CHAPTER 9. The Sermon.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00014",
children: []
},
%{
label: ~c"CHAPTER 10. A Bosom Friend.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00015",
children: []
},
%{
label: ~c"CHAPTER 11. Nightgown.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00016",
children: []
},
%{
label: ~c"CHAPTER 12. Biographical.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00017",
children: []
},
%{
label: ~c"CHAPTER 13. Wheelbarrow.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00018",
children: []
},
%{
label: ~c"CHAPTER 14. Nantucket.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00019",
children: []
},
%{
label: ~c"CHAPTER 15. Chowder.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00020",
children: []
},
%{
label: ~c"CHAPTER 16. The Ship.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00021",
children: []
},
%{
label: ~c"CHAPTER 17. The Ramadan.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00022",
children: []
},
%{
label: ~c"CHAPTER 18. His Mark.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00023",
children: []
},
%{
label: ~c"CHAPTER 19. The Prophet.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00024",
children: []
},
%{
label: ~c"CHAPTER 20. All Astir.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00025",
children: []
},
%{
label: ~c"CHAPTER 21. Going Aboard.",
href: ~c"8921354174505514122_2701-h-1.htm.xhtml#pgepubid00026",
children: []
},
%{
label: ~c"CHAPTER 22. Merry Christmas.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00027",
children: []
},
%{
label: ~c"CHAPTER 23. The Lee Shore.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00028",
children: []
},
%{
label: ~c"CHAPTER 24. The Advocate.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00029",
children: []
},
%{
label: ~c"CHAPTER 25. Postscript.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00030",
children: []
},
%{
label: ~c"CHAPTER 26. Knights and Squires.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00031",
children: []
},
%{
label: ~c"CHAPTER 27. Knights and Squires.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00032",
children: []
},
%{
label: ~c"CHAPTER 28. Ahab.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00033",
children: []
},
%{
label: ~c"CHAPTER 29. Enter Ahab; to Him, Stubb.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00034",
children: []
},
%{
label: ~c"CHAPTER 30. The Pipe.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00035",
children: []
},
%{
label: ~c"CHAPTER 31. Queen Mab.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00036",
children: []
},
%{
label: ~c"CHAPTER 32. Cetology.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00037",
children: []
},
%{
label: ~c"CHAPTER 33. The Specksnyder.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00038",
children: []
},
%{
label: ~c"CHAPTER 34. The Cabin-Table.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00039",
children: []
},
%{
label: ~c"CHAPTER 35. The Mast-Head.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00040",
children: []
},
%{
label: ~c"CHAPTER 36. The Quarter-Deck.",
href: ~c"8921354174505514122_2701-h-2.htm.xhtml#pgepubid00041",
children: []
},
%{
label: ~c"CHAPTER 37. Sunset.",
href: ~c"8921354174505514122_2701-h-3.htm.xhtml#pgepubid00042",
children: []
},
%{
label: ~c"CHAPTER 38. Dusk.",
href: ~c"8921354174505514122_2701-h-3.htm.xhtml#pgepubid00043",
children: []
},
%{
label: ~c"CHAPTER 39. First Night-Watch.",
href: ~c"8921354174505514122_2701-h-3.htm.xhtml#pgepubid00044",
children: []
},
%{
label: ~c"CHAPTER 40. Midnight, Forecastle.",
href: ~c"8921354174505514122_2701-h-3.htm.xhtml#pgepubid00045",
children: []
},
%{
label: ~c"CHAPTER 41. Moby Dick.",
href: ~c"8921354174505514122_2701-h-3.htm.xhtml#pgepubid00046",
children: []
},
%{
label: ~c"CHAPTER 42. The Whiteness of the Whale.",
href: ~c"8921354174505514122_2701-h-3.htm.xhtml#pgepubid00047",
children: []
},
%{
label: ~c"CHAPTER 43. Hark!",
href: ~c"8921354174505514122_2701-h-3.htm.xhtml#pgepubid00048",
...
},
%{label: ~c"CHAPTER 44. The Chart.", ...},
%{...},
...
]
}
You can see that multiple chapters can be mapped to the same manifest item. This results in a href like some-page.xhtml#some-chapter-id.
The table of contents also allows nesting.
I found this guide useful as a high level overview of the contents of an EPUB file https://help.apple.com/itc/booksassetguide/#/itc0f175a5b9.
@aymanosman I see now, I will try to tackle the issue you mentioned next week, or, please, feel free to send a PR and I can review it.
@aymanosman After the PR #93 (contributed by @neodejack) I think we cover the basics, and I believe we should let consumers of this library to parse the contents as they please, same as we currently do with the scripts, css, images, and other XHTMLs.
I will close this issue as resolved, but I'm open to continue the discussion and even reopen this issue.
Thanks, and I'm glad there is progress. Anybody interested in extracting the toc content can reference my gist.