pandoc
pandoc copied to clipboard
Docx+Citations import fails with multiple sources (Endnote)
Explain the problem. When importing a docx that has multiple sources combined in one references pandoc -s test.docx -f docx+citations -o test.json fails with
Invalid XML:
Missing root element
I am attaching a docx for reproduction with 1: multiple sources combined and then a single one. As far as I can see at first glance, the multiple sources are contained in the fldData (base64 encoded) while the single source is encoded inside the instrText.
I have reached out to the publisher to find out the exact Endnote Citation Plugin version that was used to create the document. (edit: EndNote X7.8 (Bld 11583))
Pandoc version?
pandoc 2.19.2 (installed with brew on MacOS (ARM))
Compiled with pandoc-types 1.22.2.1, texmath 0.12.5.2, skylighting 0.13,
citeproc 0.8.0.1, ipynb 0.2, hslua 2.2.1
Scripting engine: Lua 5.4
It's a strange format here; the instrText and the data aren't even in the same node:
<w:r w:rsidR="008138BA">
<w:rPr>
<w:lang w:val="en-US" />
</w:rPr>
<w:fldChar w:fldCharType="begin">
<w:fldData xml:space="preserve">
...base64data...
</w:fldData>
</w:fldChar>
</w:r>
<w:r w:rsidR="008138BA">
<w:rPr>
<w:lang w:val="en-US" />
</w:rPr>
<w:instrText xml:space="preserve">
ADDIN EN.CITE.DATA
</w:instrText>
</w:r>
And there are several of these pairs in a row.
@jgm could we maybe activate the Zotero and the Endnote reference detection separately? IMHO the Endnote detection is de facto unusable because most documents will contain combined citations, and thus they all need the feature deactivated.
Zotero, however, works great, and I think it's one of the most valuable features added to the docx reader in the last years.
Activating separately would only help if the same document contains both zotero and endnote citations. And that's not going to be common, is it?
Otherwise, I'd say: just use +citations for zotero and don't use it for endnote.
Activating it separately would allow us to still use Zotero references and ignore documents with Endnote (of which most fail with an error). We will have to catch the error and then run the conversion again having citations turned off.
Another possibility, perhaps, is that we could catch the error in pandoc and ignore such cases. Or issue a warning.