docx icon indicating copy to clipboard operation
docx copied to clipboard

Corrupt Word document from patching with an XML attribute with an ampersand

Open joshkel opened this issue 1 year ago • 3 comments

We're using the patcher feature to patch a Word document and discovered that ampersands in the XML attributes of the source document result in a corrupt Word document ("Word experienced an error trying to open the file") output.

To reproduce:

  1. Open simple-template.docx.
  2. Add a drop-down list with a "&" in one of the list items.
  3. Run the 85-template-document.ts example.

The problem is caused by inconsistent handling of XML character entities within xml-js; it decodes the & within xml-js's xml2js but does not re-encode it in xml-js's js2xml. This is apparently a known issue; see https://github.com/nashwaan/xml-js/issues/69.

I noticed that that issue has been open for a while, and xml-js hasn't had a release for a while. Maybe it would be worth an alternative implementation, such as the more popular xml2js? Alternative, docx could use an attributeValueFn, as described here.

Sample code demonstrating the issue at a lower level:

const x1 = require('xml-js');
const x2 = require('xml2js');

const xmlTests = [
  '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><a attr="This &amp; That">a</a>',
  '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><a attr="This &lt; That">a</a>',
  '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><a attr="&quot;Hello&quot;">a</a>',
  '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><a>&lt;</a>',
];

async function main() {
  const builder = new x2.Builder({ renderOpts: { pretty: false, newline: '' } });
  for (const xml of xmlTests) {
    console.log('Original: ' + xml);
    console.log('xml-js  : ' + x1.js2xml(x1.xml2js(xml)));

    const parsed = await x2.parseStringPromise(xml);
    console.log('xml2js  : ' + builder.buildObject(parsed));

    console.log();
  }
}

main();

joshkel avatar Dec 08 '23 19:12 joshkel

I like xml-js because it's bi-directional

xml2js is one way, which is not too appealing. docx would then need another library to do it the other way, which would bloat

dolanmiu avatar Dec 31 '23 22:12 dolanmiu

xml2js is one way, which is not too appealing.

Unless I'm misunderstanding, xml2js is bidirectional - it supports creating XML from JavaScript objects. (I haven't dug into the APIs and docx's usage of xml-js, so there are specific features missing.)

joshkel avatar Jan 01 '24 00:01 joshkel

Appologies, yes it does do bi-directional. It wasn't very clear in the README

Hmm, I had a look at it. It wouldn't be a simple drop in replacement as the JSON looks quite different than xml-js

But it might be worth it to switch to another library as xml-js might be dead

But on the flip side, I am hestitant to switch to another library because it would involve a lot of work, and who knows, xml2js has its cons too, many issues unanswered, and may pull requests not merged. I feel like replacing one with another would remove problems, but it would also introduce new problems...

dolanmiu avatar Jan 01 '24 01:01 dolanmiu