pandoc
pandoc copied to clipboard
[docx->markdown] code blocks are not detected
I use pandoc to convert docx documents to markdown. These documents contain code blocks, and I apply filter to properly transform them. However, it seems that in newer pandoc versions, code blocks are no longer detected by the parser.
Sample document: document.docx
Command: pandoc document.docx -f docx -t markdown_strict -s -o document.md
Result with pandoc 2.7.2:
Text:
> SELECT
> account,
> YEAR(date) AS year,
> SUM(COST(position)) AS balance
> WHERE
> currency = 'USD'
> ORDER BY 1,2;
Text text text.
Result with pandoc 2.8.1:
Text:
> SELECT
> account,
> YEAR(date) AS year,
> SUM(COST(position)) AS balance
> WHERE
> currency = 'USD'
> ORDER BY 1,2;
Text text text.
It seems that perhaps in the past we parsed paragraphs with style SourceCode as code blocks. But this stopped working. There's a comment in the docx reader that suggests that it should work:
25: - [X] CodeBlock (styled with `SourceCode`)
So something broke this. Need to look into it.
OK, it does actually work -- see test/docx/codeblocks.docx
The reason this test file works and the above document.docx does not is that the word/style.xml components of the docx containers differ.
In codeblocks.docx (working) we have
<w:style w:type="paragraph" w:customStyle="1" w:styleId="SourceCode">
<w:name w:val="Source Code" />
<w:basedOn w:val="Normal" />
<w:link w:val="VerbatimChar" />
<w:pPr>
<w:wordWrap w:val="off" />
</w:pPr>
</w:style>
while in document.xml we have
<w:style w:type="paragraph" w:customStyle="1" w:styleId="SourceCode"><w:name w:val="SourceCode"/></w:style>
If we change the latter so that we have
<w:name w:val="Source Code"/>
(note the space) then it works.
Pandoc is looking for a style with the name Source Code
, not a style with the id (or name) SourceCode
.
Not sure if this is really a bug, since pandoc does have a way of recognizing source code. But we could perhaps also react to SourceCode as the style name.
The following would be nice to have SourceCode| source_code| Source Code
And also be documented somewhere
Thanks for looking into it. I can confirm that workaround works
I am sorry to make this noob question, but how can pandoc recognize a block code in a docx document? Do I have to create a style in Word called "Source Code"?
I am trying to convert a docx document with code blocks to markdown, but the results are always with backslahes (\) before the backticks (`) chars.
Thank you in advance
Yes correct.
Thanks @masters3d for the information. I have tried in so many ways, but none has worked. Is there an example where I can download a file with the correct style to be used?
In addition, is there a way to directly indicate that the code is on a certain format (e.g., using {.python})?