markdown-transform
                                
                                 markdown-transform copied to clipboard
                                
                                    markdown-transform copied to clipboard
                            
                            
                            
                        CiceroMark OOXML transform
The existing DOCX support is partial and is poor quality with many real-world DOCX files. It would be preferable to have a first-class bidirectional transformation from CiceroMark to/from OOXML.
Preferred solution
Integrate an OOXML <-> CiceroMark transform into the project.
Alternatives
We currently use a 3rd-party library to do DOCX -> Markdown transformation, which has a number of issues.
See: https://github.com/accordproject/markdown-transform/issues/144
Additional context
- http://officeopenxml.com/WPdocument.php
- https://docs.microsoft.com/en-us/office/dev/add-ins/word/create-better-add-ins-for-word-with-office-open-xml
- https://docs.microsoft.com/en-us/openspecs/office_standards/ms-docx/b839fe1f-e1ca-4fa6-8c26-5954d0abbccd
- OOXML document.xml.txt
Accord Project Schemas:
- https://models.accordproject.org/markdown/[email protected]
- https://models.accordproject.org/markdown/[email protected]
Mapping Table
- Document -> w:document
- Paragraph -> w:p
- Text -> w:t
- ?? -> w:tab
- ?? -> w: noBreakHyphen
- ?? -> w: softHyphen
 
- Linebreak -> w:br
- Softbreak -> w:cr (?)
- List -> w:numbering
- ListItem -> w:num
- Strong -> w:b
- Emph -> w:i
- Variable -> w:sdt (content control)
- Heading -> ?? (infer from style?)
- Link -> w:hyperlink
@DianaLease @irmerk what is the status of this please? Is there something I can do?
The work for supporting this transform is captured in the algoo-ooxml branch.
@algomaster99 are you able to update on this?
The branch algoo-ooxml currently only comprises of OOXML -> CiceroMark transformer and it has only been perfected for [email protected].
Currently parsed entities
It transfers the following OOXML entities into CiceroMark:
- There are two types of w:p. One is a heading the other is an actually paragraph. It is decided by thew:pStyleattribute.
 to<w:p w:rsidR="009D4C12" w:rsidRDefault="009D4C12"> <w:pPr> <w:pStyle w:val="Heading2"/> </w:pPr> <w:r> <w:rPr> <w:sz w:val="40"/> </w:rPr> <w:t>Acceptance of Delivery.</w:t> </w:r> </w:p>{ "$class": "org.accordproject.commonmark.Heading", "level": "2", "nodes": [ { "$class": "org.accordproject.commonmark.Text", "text": "Acceptance of Delivery." } ] },
- Variable
 to this<w:sdt> <w:sdtPr> <w:rPr> <w:color w:val="000000"/> <w:sz w:val="24"/> <w:highlight w:val="green"/> </w:rPr> <w:alias w:val="Shipper1 | org.accordproject.organization.Organization"/> <w:tag w:val="shipper"/> <w:id w:val="1083948321"/> <w15:webExtensionLinked/> </w:sdtPr> <w:sdtContent> <w:r> <w:rPr> <w:color w:val="000000"/> <w:sz w:val="24"/> <w:highlight w:val="green"/> </w:rPr> <w:t>"Party A"</w:t> </w:r> </w:sdtContent> </w:sdt>{ "$class": "org.accordproject.ciceromark.Variable", "value": "\"Party A\"", "name": "shipper", "elementType": "org.accordproject.organization.Organization" },
More entities include the org.accordproject.commonmark.Text and org.accordproject.commonmark.Softbreak. Refer to the cases here to understand how it processes the OOXML.
What is the input to the parser?
This function initiates the transformation of OOXML -> CiceroMark. The OOXML is very long and we only need content under this block - <pkg:part pkg:name="/word/document.xml". This is where all the content of the document resides.
Test by running the test suite. The OOXML it processes is fetched from the document and it gets converted to a CiceroMark representation.
CiceroMark -> OOXML
This is directly done in the cicero-word-add-in repo. The source code can be found here.
@dselman @algomaster99 I have created a new issue depicting the implemented and left transformations. Let me know if there is anything to add. The issue is mentioned here.