stencil icon indicating copy to clipboard operation
stencil copied to clipboard

pandoc cannot parse rendered document

Open erdos opened this issue 3 years ago • 7 comments

Pandoc works for templates but not for rendered documents:

[erdos@localhost stencil]$ pandoc --verbose  -f docx -t markdown  /tmp/tmp.nHI5VewqgZ
couldn't parse docx file

erdos avatar Aug 15 '21 19:08 erdos

Doesn't work for -t pdf either. @erdos any idea where the problem could lie? Not sure I can help, but I could give it a try.

jcpsantiago avatar Feb 08 '22 10:02 jcpsantiago

Hello, I am not sure at this point.

My initial thought was that stencil may be using different xml aliases than what we have in the source docx file. (a docx is just a zip file with a buch of xml files in it.) We can test it by diffing the template to the rendered document to see the changes made by stencil.

erdos avatar Feb 08 '22 11:02 erdos

Would this be the word/document.xml file? I'll diff mine and see what is happening

jcpsantiago avatar Feb 08 '22 12:02 jcpsantiago

the diff is massive, because the original xml is two lines, whereas stencil's is a single line Original starts with

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document

Stencil starts with

<?xml version="1.0" encoding="UTF-8"?><a:document ...

that's already different, the rest is hard to see because vimdiff essentially says everything is different 😅 I'll try recompressing the Stencil version with the same header and see what happens.

Results: changing this leaves me with an invalid file, so I'm doing something wrong. How do you repack the separate files into a .docx? I was just ziping it

jcpsantiago avatar Feb 08 '22 13:02 jcpsantiago

Results: changing this leaves me with an invalid file, so I'm doing something wrong. How do you repack the separate files into a .docx? I was just ziping it

just zipping the file again should work if you keep the original file paths and names.

erdos avatar Feb 10 '22 12:02 erdos

I had messed up some tags, that's why it wasn't working. In any case, pandoc still refuses to compile it to pdf (or whatever else). Stencil is definitely doing a bit more than just replacing the templating text. Again, it's hard to do a good diff, because the order of arguments and their tags, everything is different. The start of my documents: Original template

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" [... other xmlns ..]

Stencil's

<?xml version="1.0" encoding="UTF-8"?> <w:document xmlns:ign26758="http://schemas.microsoft.com/office/word/2010/wordml" 

Notice:

  • no standalone
  • no line break
  • different order
  • the wordml schema does not have xmlns:ign26758, instead it's xmlns:w14

No idea if any of these make different, since the resulting doc still opens in Word without any complaints. AFAIK these are just minor tags, which shouldn't affect pandoc, but I'm not an expert. Do you know of any resources I could explore to understand the tags?

jcpsantiago avatar Feb 12 '22 19:02 jcpsantiago

the wordml schema does not have xmlns:ign26758, instead it's xmlns:w14

I think this will be the key here. Many ooxml readers expect specific xml alias names (even though it should not affect how the xml is parsed). It sounds very much like #55 and #56 and #97.

Do you know of any resources I could explore to understand the tags?

I use http://officeopenxml.com/WPdocument.php as a reference for OOXML, however, it does not write about the nuances of LO and Word or Pandoc.

erdos avatar Feb 15 '22 09:02 erdos

Fixed in commit https://github.com/erdos/stencil/commit/0314df26e5f42ec4ed29d51b0a6ebd22bd58c382 to be released in 0.5.1 soon.

erdos avatar Dec 20 '22 18:12 erdos