stencil
stencil copied to clipboard
pandoc cannot parse rendered document
Pandoc works for templates but not for rendered documents:
[erdos@localhost stencil]$ pandoc --verbose -f docx -t markdown /tmp/tmp.nHI5VewqgZ
couldn't parse docx file
Doesn't work for -t pdf
either. @erdos any idea where the problem could lie? Not sure I can help, but I could give it a try.
Hello, I am not sure at this point.
My initial thought was that stencil may be using different xml aliases than what we have in the source docx file. (a docx is just a zip file with a buch of xml files in it.) We can test it by diffing the template to the rendered document to see the changes made by stencil.
Would this be the word/document.xml
file? I'll diff mine and see what is happening
the diff is massive, because the original xml is two lines, whereas stencil's is a single line Original starts with
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document
Stencil starts with
<?xml version="1.0" encoding="UTF-8"?><a:document ...
that's already different, the rest is hard to see because vimdiff
essentially says everything is different 😅 I'll try recompressing the Stencil version with the same header and see what happens.
Results: changing this leaves me with an invalid file, so I'm doing something wrong. How do you repack the separate files into a .docx
? I was just zip
ing it
Results: changing this leaves me with an invalid file, so I'm doing something wrong. How do you repack the separate files into a .docx? I was just ziping it
just zipping the file again should work if you keep the original file paths and names.
I had messed up some tags, that's why it wasn't working. In any case, pandoc still refuses to compile it to pdf (or whatever else). Stencil is definitely doing a bit more than just replacing the templating text. Again, it's hard to do a good diff, because the order of arguments and their tags, everything is different. The start of my documents: Original template
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" [... other xmlns ..]
Stencil's
<?xml version="1.0" encoding="UTF-8"?> <w:document xmlns:ign26758="http://schemas.microsoft.com/office/word/2010/wordml"
Notice:
- no
standalone
- no line break
- different order
- the
wordml
schema does not havexmlns:ign26758
, instead it'sxmlns:w14
No idea if any of these make different, since the resulting doc still opens in Word without any complaints. AFAIK these are just minor tags, which shouldn't affect pandoc, but I'm not an expert. Do you know of any resources I could explore to understand the tags?
the wordml schema does not have xmlns:ign26758, instead it's xmlns:w14
I think this will be the key here. Many ooxml readers expect specific xml alias names (even though it should not affect how the xml is parsed). It sounds very much like #55 and #56 and #97.
Do you know of any resources I could explore to understand the tags?
I use http://officeopenxml.com/WPdocument.php as a reference for OOXML, however, it does not write about the nuances of LO and Word or Pandoc.
Fixed in commit https://github.com/erdos/stencil/commit/0314df26e5f42ec4ed29d51b0a6ebd22bd58c382 to be released in 0.5.1 soon.