mammoth.js
mammoth.js copied to clipboard
retain content control tags during conversion
If you click "Save As" "Web Page (.html)" in Word, the resulting HTML retains some of the content control information, like:
<span style="font-size: 10pt; font-family: 'Arial', sans-serif">
<w:Sdt
DocPart="80CD2684909242928B858862BDF7732B"
Text="t"
Title="full_name"
SdtTag="full_name"
ID="1185485261"
>
David Warner Roy
</w:Sdt>
<span style="mso-tab-count: 2"></span>
</span>
Is it possible to retain this info, particularly the SdtTag (it might be called something else in the Word XML). It looks like currently, it only grabs the content of the sdt tags, if I'm reading it correctly:
https://github.com/mwilliamson/mammoth.js/blob/52ec8fbfc6da3695d98e83e222232aaa3b1dcf43/lib/docx/body-reader.js#L325-L327
Could you post a minimal example document, the expected HTML, and the actual HTML?
@mwilliamson
https://www.dropbox.com/sh/8okzro0x05lyucr/AADvQizCCvUBoDynP4lI8wRka?dl=0
The original doc is ccf.docx. The converted one using MS Word on OSX is ccf.html. The one converted via mammoth is ccf-mammoth.html
Specifically, note this section in ccf.html:
<w:sdt docpart="F08B8FE643320F49A0D48A470442818A" text="t" title="petitioner_name" sdttag="petitioner_name" id="1185485261"><span style="mso-spacerun:yes"> </span></w:sdt>
Hmm, I'm not sure this is really in scope for Mammoth given your expected output isn't a standard HTML element.
Thanks, understandable. I was thinking it could be stored in data attributes to make it standard HTML, like data-sdttag="petitioner_name", and this could be how to retain any of these Word metadata when converting to HTML.