Improvements to Plugins
Here I added some plugin improvements.
- Dependencies
- Hooks can have an
afterlist, informing that this hook needs to run after these other hooks @pageis pulled an parsed from the html to create a general document information that is also passed as a parameter to hooks.
When building this I created the pageNumber plugin, which needs to know the general document information, width and height of the final PDF, as well as its margins. This plugin loops through all the elements in the page and embeds a data-page field that has the page that element resides on.
I also created the table of contents plugin which uses the dependencies and after parameter. The table of contents needs the pageNumber plugin to operate. I did not bundle the pageNumber functionality in with the table of contents as there will be other plugins that would use the exact same functionality in the future (index list and figure list). This way the numbering does not run multiple times.
#57 currently there is no way to do this because puppeteer does an exact copy of the header and footer template for all pages. No footnotes per page, no specific header / footer per page, etc. I know I want to add the ability to have custom ranges to the page numbers, such as using roman numerals for a preface, then regular numbers for the rest of the document. In order to do this I think we need to block puppeteer from doing header / footer's and do them ourselves. The pageNumber plugin is a start towards doing this. I am looking into a solution to embed header / footer's at the correct spots with fixed sizes, then puppeteer will not have any margins and render the document as it is. This way we could do custom per page header / footer's.
That's not exactly what I meant by "open an issue before a pull request", but maybe I wasn't very clear. Issues are for exposing a problem, and discuss a solution. I'll be a bit long, but that's so I can expose my thought process on this.
Here you are addressing several problems at once and directly proposing code, this is more a PR than an issue report.
You start from "how to make a ToC" (this one has its own issue open on Github, #8), and then go on to tackle "how to know which page an element is on" (which would deserve its own issue) and on the way you find that the plugin API is too restrictive (again, this could be another issue and another discussion).
Regarding the ToC, your code proposes a solution, which could have been better discussed/validated before. It is not necessary, but can avoid spending time on the wrong things. You should also describe the solution in the PR description (you generally writes down how your PR changes the code, but not how the algorithm will work). From the code, I infer that your solution consists of two ideas:
First, the ToC is rendered without page numbers by looking for h1, h2, etc. in the page. This idea we discussed earlier and I like it, and I'll certainly keep it.
Second, infer the page of the different sections by looking at the "y-offset" of their elements. I believe this solution is flawed, because a layout of the printed PDF can have significant differences with the layout of your page.
A first problematic case is when you have a picture at the end of a page. If the height of the picture is slightly more than the space left on the page, Chrome will leave a blank space at the end of the page and report the picture on the next page. Everything gets pushed, compared to the "y-offset" they had in the browser. You will have the same problem for any element with a page-break-inside: avoid property, and even worse problems for more complex layouts, like two-column elements.
I also think that this solution is not elegant, it obliges you write lots of code to parse the CSS, and kind of emulate the PDF printer, and the users will have to use a special newPage mixin from your plugin to break pages. I am not using this feature so I am not very motivated to propose solutions but if I had to, I would try in this order:
- Insist to the chrome team so that chromium adds a "ToC functionality" the same way they provide a footer and header already.
- Try a post-PDF hook, where the PDF is parsed to retrieve the section pages and replace the right number in the TOC. I believe this one has chances to work, using possibly pdf.js, and would require no change in the current code.
- Have a look at how other software manage ToC (it seems that wkhtml2pdf has a toc feature, not sure how great it is. Maybe propose a wkhtml2pdf backend ?
- Try your solution, but expect bugs due to the issues I mention above.
- Give up and call it a problem too complicated at the moment and not worth reinventing the wheel
In summary, are you certain there is no better solution than computing page numbers using javascript (which may not work), can you prove that the solution will work with the "picture at page bottom" problem I mention, and if it doesn't, what solutions can we find ?
I really wanted to take the time to expose my concerns with this solution and more importantly with the process (issues first, then discussion, and PR at the end)
The image problem is an easy fix, use element.offsetTop + element.offsetHeight that reports the y pos from the bottom of the element. Then if it gets bumped off the the next page the page number is incremented, additionally I had planed to add that with a range, should an element (mostly paragraphs) extend across multiple pages, set data-page to a range: 2-3.
As for the page break, I could make the pageNumber scan for such affects and apply a range to it as before. The two columns thing I had not thought of before. From what I can tell, puppeteer does not have an option to do automatically from a list, in order for a multi-column workflow, it must be setup in the document:
<div id="left" style="width: 300px; float: left">...</div>
<div id="right" style="width: 300px; float: left">...</div>
With element.offsetTop would still report the value. As I write, I realise, that as the page number loops through all elements, it does not reset the page number, so yes, it is broken in this regard right now. I will need to find a solution to overcome this, not sure how.
It looks like there are issues open for this kind of functionality in the puppeteer github. I hope it does happen, make this easier. pdf.js from what I can tell is only a browser interface for displaying pdf's I did not see any options for manipulating them. I did look at wkhtml2pdf and it looks like a pretty good option, I have not spent much time, but it could be viable, the node wrappers I have seen are a bit limited from what I can tell, only wrapping the command line interface and there it seems to be forcing the use of bash, (linux and osx limitation) ill have to try it next time I spin up my windows vm. I also looked at jspdf, which may be a viable option as well, having per page operations, but from what I can tell the fromHTML is very limiting, it does have metadata features for the final document that is nice. I also found html-pdf which looked promising, but it has not been updated in a while, and is using phantomjs, I would prefer to stick with puppeteer as it supports the modern html stuff like css grids.
Basically, I think we are stuck with either waiting for puppeteer to have this functionality (thus no ToC, figure list, index list, custom numbering, footnotes, per-page custom header/footers, etc.) or make it ourselves which would be a lot of work.
OK, thx I am coming from WKHTML2PDF and used this feature before. So I know some how there is a solution to this. May be you also have this solution in mind?
Never the less, thx