Stirling-PDF
Stirling-PDF copied to clipboard
[Feature Request]: Split PDF by chapter
Feature Description
- Detect chapters by finding and interpreting a table of contents.
- Split the source PDF into multiple PDF's: one per chapter. The table of contents should also have its own PDF.
- Optionally label the output files as well; eg
0. Table of contents,1. Introduction - Ideally we have some configurable chapter tree options
levels: 1would only split top-level chapters,levels: 2would split subchapters as well, eg1.1. Introduction - Installation,1.2 Introduction - Getting started)
Why is this feature valuable?
This could be useful for many purposes:
- Splitting a huge document up in chapters could help teachers providing subsets of materials to their students
- It might be more searchable / scannable when looking in a folder
- Document indexing and search such as Elasticsearch or Azure cognitive search
- If a huge document is split up into chapters, best-match searches are way more meaningful when the document is split up into chapters. This is also better than splitting up a document into pages, because inside a chapter, we can keep the context about that chapter.
Suggested Implementation
- Either be really fancy and auto-detect a table of contents
- Or allow to specify that there is a table of contents, let the user specify the page numbers
- Interpret each line inside the content of the table of contents: Usually the title is always on the left and page number on the right.
- Create a map of the table of contents, let the user confirm it is correct before continuing
Additional Information
To be tested on huge and official documents
No Duplicate of the Feature
- [X] I have verified that there are no existing features requests similar to my request.
I think it would be more suitable for this to be 2 separate steps. First, extract the page numbers from the toc and then split it using "Split PDF". For extracting the page numbers, maybe we could have a feature that runs a regex on the text of some page number(s), and outputs that. Could include some common expressions as well to make it easier.
For PDFs with predefined outlines, check this draft: https://github.com/Stirling-Tools/Stirling-PDF/pull/1786