markdown-transform icon indicating copy to clipboard operation
markdown-transform copied to clipboard

CiceroMark OOXML transform

Open dselman opened this issue 5 years ago • 4 comments

The existing DOCX support is partial and is poor quality with many real-world DOCX files. It would be preferable to have a first-class bidirectional transformation from CiceroMark to/from OOXML.

Preferred solution

Integrate an OOXML <-> CiceroMark transform into the project.

Alternatives

We currently use a 3rd-party library to do DOCX -> Markdown transformation, which has a number of issues.

See: https://github.com/accordproject/markdown-transform/issues/144

Additional context

  • http://officeopenxml.com/WPdocument.php
  • https://docs.microsoft.com/en-us/office/dev/add-ins/word/create-better-add-ins-for-word-with-office-open-xml
  • https://docs.microsoft.com/en-us/openspecs/office_standards/ms-docx/b839fe1f-e1ca-4fa6-8c26-5954d0abbccd
  • OOXML document.xml.txt

Accord Project Schemas:

Mapping Table

  • Document -> w:document
  • Paragraph -> w:p
  • Text -> w:t
    • ?? -> w:tab
    • ?? -> w: noBreakHyphen
    • ?? -> w: softHyphen
  • Linebreak -> w:br
  • Softbreak -> w:cr (?)
  • List -> w:numbering
  • ListItem -> w:num
  • Strong -> w:b
  • Emph -> w:i
  • Variable -> w:sdt (content control)
  • Heading -> ?? (infer from style?)
  • Link -> w:hyperlink

dselman avatar Jul 07 '20 10:07 dselman

@DianaLease @irmerk what is the status of this please? Is there something I can do?

dselman avatar Sep 21 '20 08:09 dselman

The work for supporting this transform is captured in the algoo-ooxml branch.

@algomaster99 are you able to update on this?

jolanglinais avatar Sep 21 '20 14:09 jolanglinais

The branch algoo-ooxml currently only comprises of OOXML -> CiceroMark transformer and it has only been perfected for [email protected].

Currently parsed entities

It transfers the following OOXML entities into CiceroMark:

  1. There are two types of w:p. One is a heading the other is an actually paragraph. It is decided by the w:pStyle attribute.
      <w:p w:rsidR="009D4C12" w:rsidRDefault="009D4C12">
        <w:pPr>
          <w:pStyle w:val="Heading2"/>
        </w:pPr>
        <w:r>
          <w:rPr>
            <w:sz w:val="40"/>
          </w:rPr>
          <w:t>Acceptance of Delivery.</w:t>
        </w:r>
      </w:p>
    
    to
    {
    "$class": "org.accordproject.commonmark.Heading",
    "level": "2",
    "nodes": [
      {
        "$class": "org.accordproject.commonmark.Text",
        "text": "Acceptance of Delivery."
      }
    ]
    },
    
  2. Variable
    <w:sdt>
      <w:sdtPr>
        <w:rPr>
          <w:color w:val="000000"/>
          <w:sz w:val="24"/>
          <w:highlight w:val="green"/>
        </w:rPr>
        <w:alias w:val="Shipper1 | org.accordproject.organization.Organization"/>
        <w:tag w:val="shipper"/>
        <w:id w:val="1083948321"/>
        <w15:webExtensionLinked/>
      </w:sdtPr>
      <w:sdtContent>
        <w:r>
          <w:rPr>
            <w:color w:val="000000"/>
            <w:sz w:val="24"/>
            <w:highlight w:val="green"/>
          </w:rPr>
          <w:t>"Party A"</w:t>
        </w:r>
      </w:sdtContent>
    </w:sdt>
    
    to this
    {
      "$class": "org.accordproject.ciceromark.Variable",
      "value": "\"Party A\"",
      "name": "shipper",
      "elementType": "org.accordproject.organization.Organization"
    },
    

More entities include the org.accordproject.commonmark.Text and org.accordproject.commonmark.Softbreak. Refer to the cases here to understand how it processes the OOXML.

What is the input to the parser?

This function initiates the transformation of OOXML -> CiceroMark. The OOXML is very long and we only need content under this block - <pkg:part pkg:name="/word/document.xml". This is where all the content of the document resides.

Test by running the test suite. The OOXML it processes is fetched from the document and it gets converted to a CiceroMark representation.

CiceroMark -> OOXML

This is directly done in the cicero-word-add-in repo. The source code can be found here.

algomaster99 avatar Sep 22 '20 15:09 algomaster99

@dselman @algomaster99 I have created a new issue depicting the implemented and left transformations. Let me know if there is anything to add. The issue is mentioned here.

K-Kumar-01 avatar May 27 '21 10:05 K-Kumar-01