zeal
zeal copied to clipboard
Wrapped PDF documents with index
I have a lot of documentation in PDF format. Many documents are hundreds, and even thousands of pages long, and sometimes split into multiple volumes. For example, the ARM v7 Architecture Reference Manual is over 700 pages, and the combined x86 SDM is nearly 5000 pages long.
These PDFs are pain to search around, as you don't have the super-easy searching abilities and UI (esp. forward/back and tabs) as in Zeal. And if you are working from these documents, you're probably neck-deep in dozens of register names or instruction codes, where quick reference would be amazingly useful.
It would be really useful if, as well as HTML pages, you could have an index looking into an existing PDF (where the index is maintained in a similar way to a docset is now).
There are some people who take it upon themselves to decompile these PDFs into HTML pages, with various degrees of success and accuracy (eg. https://github.com/zneak/x86doc). But unless you are very lucky, these conversions are rare and unlikely to be exactly what you want, and doing it fresh for each PDF is a long and miserable task!
This is a kind of woolly and long-term thing, but it would be incredibly useful to me. I see mention made of CHM documents, but I thought I'd throw indexing of PDFs out as well!
Viewing PDFs should be achievable with PDF.js from Mozilla, but I am not sure how much value we can provide here. All PDF viewers have a lot more features that Zeal. Also PDF files do not have a keyword index, so searching would be only possible through either ToC, or all contents.
I am not opposed to the idea, but I don't have a clear picture yet, and this will be pretty low priority.
Sure, I understand it's pretty low priority. The problem I have with existing (normal) PDF viewers is that you need to find and open the right PDF (or alt-tab to it) and their UIs' section-navigation are pretty clunky for Zeal-like purposes. Evince actually has a relatively good way (you can type section titles into the page number box), but even then it's not as clear as in Zeal. There is often not a "history" either. Additionally, it depends on the quality of the PDF index. To be fair, in large documentation manuals, this is normally at least "OK".
WRT to the index, I was thinking more of a side-car index, kind of like a docset dsidx file, which provides pointers into a PDF (page number and perhaps the element). At a basic level, this could be implicitly derived from the ToC in the PDF (so any PDF with a good ToC works). However, it could also be constructed in a much more detailed way by an interested party much as dsidx's are made by docset compilers.
I'm not really asking for any immediate action, more that in a future format, indexing into "hardcoded" files (i.e. PDF, but also other formats, perhaps) is at least possible, as it would be incredibly useful for electronics, where HTML documentation is rare, PDF is very common, and documents run to 1000s of pages.
I think generating and hosting indexes for many PDFs would be impossible due to copyright concerns.
Looks like PDF supports indices. So if provided, it should be possible to display.
Are there any good public registries of technical or scientific PDFs?
One example of FOSS documentation often provided as PDF is LaTeX package documents (https://ctan.org/pkg), which actually don't seem to always have ToCs (and are split up into a PDF per package).
If a mechanism for collating and indexing sets of 3rd party PDF existed, there are lots of sources that would be opened up. There are lots of public domain documents (eg from the US govt, NASA, etc) that are not available as HTML. One large source is: https://archive.org/details/additional_collections. An indexed collection of US or UN legal documents, for example, would be a relatively novel and (I imagine) useful thing, though clearly not the focus of Zeal. Also, perhaps https://arxiv.org would be usable to make indexed topic-oriented collections of scientific papers.
However, the killer app for me is being able to collate and index electronics manuals and datasheets. Even if these indexes couldn't be published by Zeal, they'd be incredibly useful.
I'm not really asking for any specific index per se, I'm asking for the capability for "pointing into" bundles of PDFs like docsets "point into" HTML bundles.
RE copyright, it would certainly not be allowed to distribute original or modified (non-free) PDFs with extra TOCs or indexes embedded in them. Indexes (and lists, like phone books) themselves do attract copyright, but that copyright is separate to the main work (though often assigned to the author or publisher). Therefore, you can't take someone else's index and publish it. However, as I understand it, it is possible to create and distribute an index of your own separately to the original work. At least, this is my non-lawyerly understanding, and, like all copyright things, probably depends on location and exactly what is in the index.
EDIT: another useful "pile o' PDFs" appropriate for open source software are the C++ standardisation papers: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/
Linking to #1261 for enablement of the Qt WebEngine's built-in PDF viewing functionality.
The functionality described in this ticket is about deeper integration with the PDF format, so I'll keep this open.