Stirling-PDF icon indicating copy to clipboard operation
Stirling-PDF copied to clipboard

[Feature Request]: Split PDF by chapter

Open pepijnolivier opened this issue 1 year ago • 2 comments

Feature Description

  • Detect chapters by finding and interpreting a table of contents.
  • Split the source PDF into multiple PDF's: one per chapter. The table of contents should also have its own PDF.
  • Optionally label the output files as well; eg 0. Table of contents, 1. Introduction
  • Ideally we have some configurable chapter tree options
    • levels: 1 would only split top-level chapters,
    • levels: 2 would split subchapters as well, eg 1.1. Introduction - Installation, 1.2 Introduction - Getting started)

Why is this feature valuable?

This could be useful for many purposes:

  • Splitting a huge document up in chapters could help teachers providing subsets of materials to their students
  • It might be more searchable / scannable when looking in a folder
  • Document indexing and search such as Elasticsearch or Azure cognitive search
    • If a huge document is split up into chapters, best-match searches are way more meaningful when the document is split up into chapters. This is also better than splitting up a document into pages, because inside a chapter, we can keep the context about that chapter.

Suggested Implementation

  • Either be really fancy and auto-detect a table of contents
  • Or allow to specify that there is a table of contents, let the user specify the page numbers
  • Interpret each line inside the content of the table of contents: Usually the title is always on the left and page number on the right.
  • Create a map of the table of contents, let the user confirm it is correct before continuing

Additional Information

To be tested on huge and official documents

No Duplicate of the Feature

  • [X] I have verified that there are no existing features requests similar to my request.

pepijnolivier avatar Jul 24 '24 06:07 pepijnolivier

I think it would be more suitable for this to be 2 separate steps. First, extract the page numbers from the toc and then split it using "Split PDF". For extracting the page numbers, maybe we could have a feature that runs a regex on the text of some page number(s), and outputs that. Could include some common expressions as well to make it easier.

sbplat avatar Aug 31 '24 01:08 sbplat

For PDFs with predefined outlines, check this draft: https://github.com/Stirling-Tools/Stirling-PDF/pull/1786

Rudra-241 avatar Aug 31 '24 22:08 Rudra-241