mammoth.js icon indicating copy to clipboard operation
mammoth.js copied to clipboard

retain content control tags during conversion

Open rhuang opened this issue 3 years ago • 4 comments

If you click "Save As" "Web Page (.html)" in Word, the resulting HTML retains some of the content control information, like:

<span style="font-size: 10pt; font-family: 'Arial', sans-serif">
  <w:Sdt
    DocPart="80CD2684909242928B858862BDF7732B"
    Text="t"
    Title="full_name"
    SdtTag="full_name"
    ID="1185485261"
    >
      David Warner Roy
  </w:Sdt>
  <span style="mso-tab-count: 2"></span>
</span>

Is it possible to retain this info, particularly the SdtTag (it might be called something else in the Word XML). It looks like currently, it only grabs the content of the sdt tags, if I'm reading it correctly:

https://github.com/mwilliamson/mammoth.js/blob/52ec8fbfc6da3695d98e83e222232aaa3b1dcf43/lib/docx/body-reader.js#L325-L327

rhuang avatar Apr 27 '22 23:04 rhuang

Could you post a minimal example document, the expected HTML, and the actual HTML?

mwilliamson avatar Sep 19 '22 09:09 mwilliamson

@mwilliamson

https://www.dropbox.com/sh/8okzro0x05lyucr/AADvQizCCvUBoDynP4lI8wRka?dl=0

The original doc is ccf.docx. The converted one using MS Word on OSX is ccf.html. The one converted via mammoth is ccf-mammoth.html

Specifically, note this section in ccf.html:

<w:sdt docpart="F08B8FE643320F49A0D48A470442818A" text="t" title="petitioner_name" sdttag="petitioner_name" id="1185485261"><span style="mso-spacerun:yes">&nbsp;&nbsp;</span></w:sdt>

rhuang avatar Sep 19 '22 19:09 rhuang

Hmm, I'm not sure this is really in scope for Mammoth given your expected output isn't a standard HTML element.

mwilliamson avatar Sep 19 '22 19:09 mwilliamson

Thanks, understandable. I was thinking it could be stored in data attributes to make it standard HTML, like data-sdttag="petitioner_name", and this could be how to retain any of these Word metadata when converting to HTML.

rhuang avatar Sep 19 '22 19:09 rhuang