mammoth.js icon indicating copy to clipboard operation
mammoth.js copied to clipboard

Support page breaks

Open JohnMcLear opened this issue 11 years ago • 15 comments

Any docs for how to support page breaks?

JohnMcLear avatar Dec 09 '13 22:12 JohnMcLear

What sort of behaviour would you expect? Page breaks strike me as being an artefact of printing of paper, which doesn't really apply when translating to HTML. Open to suggestions though.

mwilliamson avatar Dec 10 '13 08:12 mwilliamson

To be honest just whacking in a <span class='pageBreak'></span> would be fine for me.

I'd expect you would want me to use a custom style rule for this, if that's teh case that's fine just lemme know which stylemap key to use :)

I use page-break-after:always;page-break-inside:avoid;-webkit-region-break-inside: avoid; to generate the actual page breaks in Etherpad.

JohnMcLear avatar Dec 10 '13 12:12 JohnMcLear

I don't use it, but I happen to know that Dreamweaver would wrap content in <div> tags to mark section-breaks (Page Layout > Page Setup > Breaks). I wouldn't be a fan of this approach as I sometimes use parent-child selectors in my CSS.

I'm not sure that I like the idea of adding classes to the output, @JohnMcLear.

How about adding a simple <hr/> tag?

MCTaylor17 avatar Mar 09 '16 00:03 MCTaylor17

The suggestion of using a custom style mapping is the approach that seems best to me. That way, by default we do nothing, but the user can customise the behaviour to whatever HTML they want.

mwilliamson avatar Mar 09 '16 16:03 mwilliamson

Can you give me example of how to write style map for page breaks to hr tag?

swapnil-bawkar avatar Feb 16 '17 14:02 swapnil-bawkar

Page breaks aren't supported at the moment. There's some code to handle them, but that likely requires some more work.

For the technical detail: one way that Word encodes page breaks is as an element within a paragraph. As it works right now, that would result in hr tags with p elements, which likely isn't the desired behaviour. Lifting the breaks up to the top level is likely to give better results.

mwilliamson avatar Feb 16 '17 17:02 mwilliamson

I have a use case where customers – wrongly – insert page breaks at the end of pages, and I need to replace them with a space. For that reason it would be good to have a style mapping available that captures (manual) page breaks.

jkorff avatar Feb 19 '19 01:02 jkorff

Having the page breaks would be nice for translating into other formats or processing the output html

pirtlj avatar Jul 06 '23 22:07 pirtlj

Hi there, and thank you for this awesome lib :)

I'm using mammoth to turn a structured (with specific styles) .docx file into HTML, do some tweaks on it and then use PagedJS to turn it into a PDF to be printed.

In this case the output is in fact paper again, so page breaks do matter.

Could you please consider supporting page breaks ?

If you have never stumbled upon this, there is a whole open-source movement (the Coko Foundation) advocating for using HTML as the Single Source for publishing books and journal papers using the CSS PagedMedia standard to define the layout of the PDF output. This standard hasn't been implemented yet by any of the major browsers so they built PagedJS that is in essence a glorified polyfill for this standard that is already used in production for many publishing houses, and recently used to produce both a book and a webapp for the Louvres in Paris from the same HTML source.

jerefrer avatar Feb 17 '24 10:02 jerefrer

As above, the problem is that it's not obvious (to me, at least!) what the expected behaviour would be, given a page break can occur in the middle of a paragraph.

If you can provide a minimal example document and the expected HTML (especially with mid-paragraph page breaks), then that would help.

mwilliamson avatar Feb 17 '24 13:02 mwilliamson

Here I meant only manual page breaks, it didn't even occur to me that one would want to know about automatic page breaks when text naturally overflows a page and continues on the next one :)

In the case of manual page breaks is that already possible ? For me it could be either a separate tag or a way to apply a specific CSS class to the first element after the page break. If there is already a way to do this maybe adding it to the doc wouldn't hurt :)

jerefrer avatar Feb 19 '24 10:02 jerefrer

There's some support for breaks, but it is intentionally undocumented since it's still subject to change.

Could you provide a minimal example document and the expected HTML?

mwilliamson avatar Feb 19 '24 18:02 mwilliamson

Alright so here's a very simple example .docx file: example.docx

What I'd like to get back would be either this:

<p>This content is on page one.</p>
<hr>
<p>This one on page two.</p>
<p><em>And it has</em></p>
<h1>Some more content to it</h1>
<h2>With a few styles.</h2>
<hr>
<p>This is page three.</p>

or something like that:

<p>This content is on page one.</p>
<p class="break-before">This one on page two.</p>
<p><em>And it has</em></p>
<h1>Some more content to it</h1>
<h2>With a few styles.</h2>
<hr>
<p class="break-before">This is page three.</p>

jerefrer avatar Feb 19 '24 18:02 jerefrer

I think you can already use a style map along the lines of:

br[type='page'] => hr

to get what you want, but be warned that the exact syntax and behaviour might change in the future!

mwilliamson avatar Feb 19 '24 18:02 mwilliamson

It's working 🎉 If it starts breaking one day I'll know where to look :) Thanks!

jerefrer avatar Feb 19 '24 19:02 jerefrer