Chapter/Content index preservation

Open alexdelorenzo opened this issue 7 months ago • 1 comments

Cool project, appreciate the work put into it :)

What is the feature you think should be a good addition to Dangerzone?

It would be convenient if processed PDFs, ePub, etc retained their chapter/content indexes for easy navigation.

Is your feature request related to a problem? Please describe.

Big PDFs often either have tables of contents or content indexes embedded in them that allow you to jump to page numbers that contain indexed content.

Dangerzone-processed PDFs don't retain those indexes.

Additional context

None, just not sure if that is data that can be retained through the conversion process. If it can be, this would be a nice feature.

May 26 '25 05:05 alexdelorenzo

Hi, and thanks for opening a ticket here.

I can see why that could be useful, for sure, especially on large documents.

Because of the way the conversion works, I'm not sure this is possible. When reconstructing the PDF we get a flow of pixels only, put them in the PDF, and then run optical character recognition (OCR) on the result. We don't want to have a way to keep the existing content index because it could be a target for the attackers. Here we're completely reconstructing the document, and as a result are removing all the metadata contained inside it.

One way to do that would be if the OCR is capable of detecting titles with different levels for instance. For OCR, we're using tesseract. I checked and currently nothing related to having chapters or content indexes is in their tracker.

I quickly checked on the interwebs about tools that could help us for doing this and unfortunately fell short. One thing that comes to mind would be on each page to try to detect the titles, if any, and then reconstruct that somehow in an index ourselves.

Let's keep this open and discuss it with the team, in case anybody has a good idea about it, before saying if we can / can't do it.

May 26 '25 08:05 almet