markdown-transform CiceroMark OOXML transform

The existing DOCX support is partial and is poor quality with many real-world DOCX files. It would be preferable to have a first-class bidirectional transformation from CiceroMark to/from OOXML.

Preferred solution

Integrate an OOXML <-> CiceroMark transform into the project.

Alternatives

We currently use a 3rd-party library to do DOCX -> Markdown transformation, which has a number of issues.

See: https://github.com/accordproject/markdown-transform/issues/144

Additional context

http://officeopenxml.com/WPdocument.php
https://docs.microsoft.com/en-us/office/dev/add-ins/word/create-better-add-ins-for-word-with-office-open-xml
https://docs.microsoft.com/en-us/openspecs/office_standards/ms-docx/b839fe1f-e1ca-4fa6-8c26-5954d0abbccd
OOXML document.xml.txt

Accord Project Schemas:

https://models.accordproject.org/markdown/[email protected]
https://models.accordproject.org/markdown/[email protected]

Mapping Table

Document -> w:document
Paragraph -> w:p
Text -> w:t
- ?? -> w:tab
- ?? -> w: noBreakHyphen
- ?? -> w: softHyphen
Linebreak -> w:br
Softbreak -> w:cr (?)
List -> w:numbering
ListItem -> w:num
Strong -> w:b
Emph -> w:i
Variable -> w:sdt (content control)
Heading -> ?? (infer from style?)
Link -> w:hyperlink

Jul 07 '20 10:07 dselman

@DianaLease @irmerk what is the status of this please? Is there something I can do?

Sep 21 '20 08:09 dselman

The work for supporting this transform is captured in the algoo-ooxml branch.

@algomaster99 are you able to update on this?

Sep 21 '20 14:09 jolanglinais

The branch algoo-ooxml currently only comprises of OOXML -> CiceroMark transformer and it has only been perfected for [email protected].

Currently parsed entities

It transfers the following OOXML entities into CiceroMark:

There are two types of w:p. One is a heading the other is an actually paragraph. It is decided by the w:pStyle attribute.

  <w:p w:rsidR="009D4C12" w:rsidRDefault="009D4C12">
    <w:pPr>
      <w:pStyle w:val="Heading2"/>
    </w:pPr>
    <w:r>
      <w:rPr>
        <w:sz w:val="40"/>
      </w:rPr>
      <w:t>Acceptance of Delivery.</w:t>
    </w:r>
  </w:p>

to

{
"$class": "org.accordproject.commonmark.Heading",
"level": "2",
"nodes": [
  {
    "$class": "org.accordproject.commonmark.Text",
    "text": "Acceptance of Delivery."
  }
]
},

Variable

<w:sdt>
  <w:sdtPr>
    <w:rPr>
      <w:color w:val="000000"/>
      <w:sz w:val="24"/>
      <w:highlight w:val="green"/>
    </w:rPr>
    <w:alias w:val="Shipper1 | org.accordproject.organization.Organization"/>
    <w:tag w:val="shipper"/>
    <w:id w:val="1083948321"/>
    <w15:webExtensionLinked/>
  </w:sdtPr>
  <w:sdtContent>
    <w:r>
      <w:rPr>
        <w:color w:val="000000"/>
        <w:sz w:val="24"/>
        <w:highlight w:val="green"/>
      </w:rPr>
      <w:t>"Party A"</w:t>
    </w:r>
  </w:sdtContent>
</w:sdt>

to this

{
  "$class": "org.accordproject.ciceromark.Variable",
  "value": "\"Party A\"",
  "name": "shipper",
  "elementType": "org.accordproject.organization.Organization"
},

More entities include the org.accordproject.commonmark.Text and org.accordproject.commonmark.Softbreak. Refer to the cases here to understand how it processes the OOXML.

What is the input to the parser?

This function initiates the transformation of OOXML -> CiceroMark. The OOXML is very long and we only need content under this block - <pkg:part pkg:name="/word/document.xml". This is where all the content of the document resides.

Test by running the test suite. The OOXML it processes is fetched from the document and it gets converted to a CiceroMark representation.

CiceroMark -> OOXML

This is directly done in the cicero-word-add-in repo. The source code can be found here.

Sep 22 '20 15:09 algomaster99

@dselman @algomaster99 I have created a new issue depicting the implemented and left transformations. Let me know if there is anything to add. The issue is mentioned here.

May 27 '21 10:05 K-Kumar-01

markdown-transform markdown-transform copied to clipboard

CiceroMark OOXML transform

Preferred solution

Alternatives

Additional context

Accord Project Schemas:

Mapping Table

Currently parsed entities

What is the input to the parser?

CiceroMark -> OOXML

markdown-transform
markdown-transform copied to clipboard