pandoc icon indicating copy to clipboard operation
pandoc copied to clipboard

docx to html ignores "style separator"

Open speleo3 opened this issue 1 year ago • 6 comments

Explain the problem.

  1. Use Microsoft Word to create a docx with a "Style Separator" (Ctrl+Alt+Enter) which puts two paragraphs on one line
  2. Convert to HTML with pandoc -o out.html in.docx

Result: HTML file has paragraphs on separate lines.

Expected: Style separated paragraphs are on the same line in the HTML file.

Example Files

Input: in.docx Output:

<p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p>
<p>Paragraph 4</p>

Additional Information

The "style separator" is stored in the docx as:

      <w:pPr>
        <w:rPr>
          <w:vanish />
          <w:specVanish />
        </w:rPr>
      </w:pPr>

Pandoc version?

Tested with https://pandoc.org/try/ and locally with pandoc.exe 3.1.3

speleo3 avatar Nov 29 '23 12:11 speleo3

Pandoc is really only concerned with parsing the structure of the document. It can't represent the concept of "two paragraphs, but on the same line." (I'm not even sure how you'd do it in HTML, though I suppose it's possible with CSS trickery.) So, I'd say this is out of scope.

jgm avatar Nov 29 '23 16:11 jgm

Thanks for your comment. How would you define "the structure of the document"? For the real-world examples that I've seen so far, it would make sense to simply merge the paragraphs.

Expected output:

<p>Paragraph 1</p>
<p>Paragraph 2 Paragraph 3</p>
<p>Paragraph 4</p>

speleo3 avatar Nov 29 '23 21:11 speleo3

We could represent it that way I suppose. I tried to use Ctrl+Alt+Enter in Word and it just put me on the next line, so I haven't been able to see what this feature is all about (nor have I ever heard of it). What is it used for??

jgm avatar Nov 29 '23 22:11 jgm

How to use it in Word: First create two paragraphs. Then put the cursor on the first paragraph and hit Ctrl+Alt+Enter to join them.

I know one use case for this feature: Shorten figure captions in the table of figures. Example: table-of-tables.docx

table-of-tables-word-screenshot

speleo3 avatar Nov 30 '23 15:11 speleo3

How to use it in Word: First create two paragraphs. Then put the cursor on the first paragraph and hit Ctrl+Alt+Enter to join them.

Didn't work for me (Word 16.79.1 for Mac).

jgm avatar Nov 30 '23 16:11 jgm

Didn't work for me (Word 16.79.1 for Mac).

Good to know, and thanks for testing.

For the record, this is how other programs display my table-of-tables.docx example:

  • LibreOffice on Linux: merges the paragraphs
  • Pages on Mac: merges the paragraphs
  • Word on the Web (Microsoft 365 Online): does not merge the paragraphs ;-)

I still think that it would be very reasonable if pandoc also merges those paragraphs.

speleo3 avatar Dec 01 '23 07:12 speleo3