pandoc
pandoc copied to clipboard
docx to html ignores "style separator"
Explain the problem.
- Use Microsoft Word to create a docx with a "Style Separator" (Ctrl+Alt+Enter) which puts two paragraphs on one line
- Convert to HTML with
pandoc -o out.html in.docx
Result: HTML file has paragraphs on separate lines.
Expected: Style separated paragraphs are on the same line in the HTML file.
Example Files
Input: in.docx Output:
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p>
<p>Paragraph 4</p>
Additional Information
The "style separator" is stored in the docx as:
<w:pPr>
<w:rPr>
<w:vanish />
<w:specVanish />
</w:rPr>
</w:pPr>
Pandoc version?
Tested with https://pandoc.org/try/ and locally with pandoc.exe 3.1.3
Pandoc is really only concerned with parsing the structure of the document. It can't represent the concept of "two paragraphs, but on the same line." (I'm not even sure how you'd do it in HTML, though I suppose it's possible with CSS trickery.) So, I'd say this is out of scope.
Thanks for your comment. How would you define "the structure of the document"? For the real-world examples that I've seen so far, it would make sense to simply merge the paragraphs.
Expected output:
<p>Paragraph 1</p>
<p>Paragraph 2 Paragraph 3</p>
<p>Paragraph 4</p>
We could represent it that way I suppose. I tried to use Ctrl+Alt+Enter in Word and it just put me on the next line, so I haven't been able to see what this feature is all about (nor have I ever heard of it). What is it used for??
How to use it in Word: First create two paragraphs. Then put the cursor on the first paragraph and hit Ctrl+Alt+Enter to join them.
I know one use case for this feature: Shorten figure captions in the table of figures. Example: table-of-tables.docx
How to use it in Word: First create two paragraphs. Then put the cursor on the first paragraph and hit Ctrl+Alt+Enter to join them.
Didn't work for me (Word 16.79.1 for Mac).
Didn't work for me (Word 16.79.1 for Mac).
Good to know, and thanks for testing.
For the record, this is how other programs display my table-of-tables.docx example:
- LibreOffice on Linux: merges the paragraphs
- Pages on Mac: merges the paragraphs
- Word on the Web (Microsoft 365 Online): does not merge the paragraphs ;-)
I still think that it would be very reasonable if pandoc also merges those paragraphs.