vitepress-export-pdf icon indicating copy to clipboard operation
vitepress-export-pdf copied to clipboard

Merge PDF page number error

Open aMagicalpole opened this issue 2 years ago • 2 comments

Question:

image

Page number is not merge, Is there any way to combine the page numbers or customize the page numbers, Thank you very much

Code:

const footerTemplate = `<div style="margin-bottom: -0.4cm; height: 70%; width: 100%; display: flex; justify-content: space-between; align-items: center; color: lightgray; border-top: solid lightgray 1px; font-size: 10px;">
	<span style="margin-left: 15px;" class="url"></span><span style="margin-right: 15px;"><span class="pageNumber"></span>/<span class="totalPages"></span></span
</div>`;

aMagicalpole avatar Jun 08 '23 02:06 aMagicalpole

The PDF file format is all about producing the desired visual result for printing. It was not created for parsing the content. PDF files don’t contain a semantic layer.

Specifically, there is no information what the header, footer, page numbers, tables, and paragraphs are. The visual appearence is there and people might find heuristics to make educated guesses, but there is no way of being certain.

This is a shortcoming of the PDF file format. https://pypdf.readthedocs.io/en/stable/user/extract-text.html#missing-semantic-layer

The description language in PDF format is very similar to HTML, with the only drawback being its lack of semantics,It describes the content of PDF pages through objects, such as the following example:

3 0 obj
<< /Filter /FlateDecode /Length 191 >>
stream
x]��
�@��}�9������& �<
>�VD�J���7�QrHf�/��`��xS0ؑa����uO���g�{��
���H��&֐a���#O8"�`:E��W]7�a����}i |e*)��c6���P� 6H�4[(P�������a�
�bAoë�6�c��G�NMJWܯ�t#���
�\+�h�>>
endstream
endobj
1 0 obj
<< /Type /Page /Parent 2 0 R /Resources 4 0 R /Contents 3 0 R /MediaBox [0 0 595.28 841.89]
>>
endobj
4 0 obj
<< /ProcSet [ /PDF /Text ] /ColorSpace << /Cs1 5 0 R >> /Font << /TT1 6 0 R
>> >>
endobj

So many PDF parsing libraries cannot extract page numbers, and I cannot modify page numbers when merging PDFs. I have been thinking for a long time without a solution.

However, there is an imperfect solution, which is to turn off page numbers when generating PDF, but leave room for page numbers and add them yourself. Here is an example: https://github.com/condorheroblog/vitepress-export-pdf/commit/d26383d09313ebd3a009cee110429ada1aaed1d4#diff-79cab662fb8d5527d226a743033ffdfd879fcb65489faa6eabe35ca25a7906d5

import { readFileSync, writeFileSync } from "node:fs";
import { PDFDocument, StandardFonts, rgb } from "pdf-lib";

const existingPdfBytes = readFileSync("./vitepress.dev.pdf");
const pdfDoc = await PDFDocument.load(existingPdfBytes);
const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica);

const pages = pdfDoc.getPages();
const totalPages = pages.length;

for (let i = 0; i < totalPages; i++) {
	const page = pages[i];
	const { width } = page.getSize();
	const text = `${i + 1} / ${totalPages}`;
	const fontSize = 9;
	const textX = width - 50;
	const textY = fontSize;
	page.drawText(text, {
		x: textX,
		y: textY + 5,
		size: fontSize,
		font: helveticaFont,
		color: rgb(127 / 256, 127 / 256, 127 / 256),
	});
}

const pdfBytes = await pdfDoc.save();
writeFileSync("pagination.pdf", pdfBytes);

It's not perfect, but it's good.

condorheroblog avatar Jun 08 '23 03:06 condorheroblog

I know that Cairo at least supports page labels. Perhaps pdf-lib does as well?

asinghvi17 avatar Apr 23 '24 21:04 asinghvi17