pandoc
pandoc copied to clipboard
Plain handled more like Para than as plain text in some output formats
In order to set custom styles in ODT/Opendocument format, I tried to achieve the following output with a Lua filter:
<text:p text:style-name="CustomStyle">Some words.</text:p>
I built a Plain object containing the desired tag as RawInlines like this:
function Para(para)
para.c:insert(1, pandoc.RawInline('opendocument',
'<text:p text:style-name="CustomStyle">'))
para.c:insert(pandoc.RawInline('opendocument', '</text:p>'))
return pandoc.Plain(para.c)
end
But I end up with the following result:
pandoc -t opendocument -L test.lua <<< 'Some words.'
<text:p text:style-name="Text_20_body"><text:p text:style-name="CustomStyle">Some
words.</text:p></text:p>
So, although Plain is defined as "plain text, not a paragraph", it is handled like a Para object in Opendocument writer, which prevents from wrapping its content inside arbitrary tags. The same applies for DOCX: after having adapted the RawInlines in test.lua
, we get:
<w:p>
<w:pPr>
<w:pStyle w:val="FirstParagraph"/>
</w:pPr>
<w:p>
<w:pPr>
<w:pStyle w:val="CustomStyle"/>
</w:pPr>
<w:r>
<w:t xml:space="preserve">Some words.</w:t>
</w:r>
</w:p>
</w:p>
Is it on purpose? I acknowledge that the advantage of this behavior is that is prevents from creating invalid output. But since, to my knowledge, Plain objects can only be created in filters, one could assume that users know why they use it instead of Para, and not outputting it as fully formatted paragraphs would allow for much greater flexibility.
Also, it seems to be inconsistent across formats. The following:
[Plain [RawInline (Format "tex") "{\\bf ",Str "Some",Space,Str "words.",RawInline (Format "tex") "}"]
,Plain [RawInline (Format "tex") "{\\bf ",Str "Others.",RawInline (Format "tex") "}"]]
renders in LaTeX output:
{\bf Some words.}
{\bf Others.}
and in ConTeXt:
{\bf Some words.}
{\bf Others.}
Well, Plain is a block-level element. So some kind of block-level tag is needed in openxml, or you just have invalid openxml.
Here two design goals collide: we want Plain to be something short of a paragraph, but we also want a Plain to render validly.
Here two design goals collide: we want Plain to be something short of a paragraph, but we also want a Plain to render validly.
Well there are two sort of users:
- the regular users, those who type out their documents and then use Pandoc for export.
- there are library writers, for example Emacs Orgmode exporter
People set in (1) want a well-formed document with all the bells and whistles like styles.xml, meta.xml etc etc.
People set in (2)--the library writers--are interested in using Pandoc as a citation processor. These are precisely the people which the very recent citation
execuatble (culled out from pandoc-citeproc
) targets. People in set (2) are interested in the citation aspects of Pandoc, and use it as a CLI tool or a library to generate text fragments (i.e., inlines as opposed to a paragraph).
ie. People in set (2) are interested in well-formed inlines only and are NOT interested in well-formed documents.
Well there are two sort of users:
- the regular users, those who type out their documents and then use Pandoc for export.
- there are library writers, for example Emacs Orgmode exporter
Perhaps surprisingly, my demand was more targetting the type 1 users, among which I am, more precisely those who use filters in order to extend the exporting capabilities of Pandoc when they export their documents. So they want to end up with a well-formed document, but may need to build arbitrary block elements with raw code. That is what I expected wrongly Plain elements were for: something like a mere list of inlines that one could wrap in whatever code we want in the target format (like <w:p> tags in OOXML with custom properties).
So, although Plain is defined as "plain text, not a paragraph", it is handled like a Para object in Opendocument writer, which prevents from wrapping its content inside arbitrary tags
Also, it seems to be inconsistent across formats.
Why not introduce a new Value constructor Plain'
--Plain Prime--which achieves the desired result. This new value constructor could be a secret, undocumented stuff.
Another solution to this general problem would be to provide functions like writeInlineOpenXML
or writeInlineOpenDocument
with signature
:: PandocMonad m => WriterOptions -> [Inline] -> Text
Le Tuesday 18 May 2021 à 08:43:55AM, John MacFarlane a écrit :
Another solution to this general problem would be to provide functions like writeInlineOpenXML or writeInlineOpenDocument with signature
:: PandocMonad m => WriterOptions -> [Inline] -> Text
So that one could do something like this?
pandoc.rawBlock('openxml', '<w:p>' .. writeInlineOpenXML(inlines) .. '</w:p>')
If I understand well, it would be great, indeed!
Yes that's the idea.
We would only need these functions in a few special cases where the current rendering of Plain has to include p tags (opendocument, openxml, others?).
Theoretically in all XML/SGML formats, I guess.
The more I think about it, the more I see how much power such writeInline<Something>
functions would give. It would permit to manipulate the resulting string, for instance to change or add attributes. This would really help to extend Pandoc capabilities with XML-based formats.
The current DocBook writer shows another way we might go: it does render Plain as just a sequence of inlines.
% pandoc -f native -t docbook
Plain [Str "hi"]
hi
It avoids generating invalid XML by using a plainToPara
function to convert Plain
to Para
in lists and Divs (which are the contexts in which Plain usually appears in content parsed by the markdown reader). This means, though, that invalid XML could be produced from manually constructed Pandoc structures, so it's not absolutely reliable.
One possibility would be to change all the XML-based writers so they work this way:
- If standalone is false and the content passed in consists of one Plain block, just render it as inlines
- Otherwise, do as before (rendering Plain as a Para)
This would avoid the need to export a new function, though the behavior is a bit complex.
It seems to me that this wouldn't allow to build arbitrary blocks (i.e. XML elements) around the Plain's content. Or maybe you could replace "if standalone is false" by "if there are no surrounding blocks or two surrounding RawBlocks or standalone is false"?
I found also the "writeInline<Something>" way satisfying on a conceptual level, since we would have an inline inside a RawBlock element, and not three RawBlocks building one XML element. But I fully understand that implementing all the great ideas one can think about would turn the developpers' work and Pandoc itself into a nightmare...
If you're using this function in a program, then I don't really see the difference between
inlinedoc <- writeInlineOpenDocument opts inlines
and
inlinedoc <- writeOpenDocument opts (Pandoc nullMeta [Plain inlines])
They do the same work, no? And if you're not using this in a program, then how exactly would you be taking advantage of writeInlineOpenDocument
?
So, from a filter, it would be possible to make a system call (for instance through pandoc.pipe) in order to pass this code to GHC using Pandoc's API? If so, I don't have any objection.
Right now filters have access to read
but not write
. It looks like what you want to do is to insert a Plain element into the AST and have it render as plain inlines; that's not something the writeInline*
functions would allow you to do.
I'd like to think more about the writers that currently render Plain with paragraph tags, and see if we can't come up with an alternative approach that will be compatible with the sort of thing you're trying to do.
Thank you!
Some notes on the current treatment of Plain in XML-based writers:
- Docx: rendered as a Para, except that in a table cell or list item, Plain is rendered with Compact style.
- OpenDocument: rendered as a Para
- DocBook: rendered as plain inline content; however, in list items and Div bodies, a plainToPara transformation is done first to eliminate the Plain elements
- JATS: rendered as Para, but there's still a vestigial plainToPara transformation in deflistItemToJATS -- as far as I can see, it could be eliminated.
- ICML: rendered with para tags but without the Paragraph style
- TEI: rendered as a Para, but again there's a vestigial use of plainToPara which seems redundant
- Powerpoint: rendered as a Para
So the question is whether we could move to a model like DocBook's for the others. We'd have to be very sure that we do the plainToPara transformation in every context where a Plain might be generated by our readers.
The test suite shows Plain occuring in
- lists
- inside Divs
- interspersed with RawBlocks for raw HTML (e.g. command test 5360)
- table cells
Also it appears as the result of parsing HTML without surrounding <p>
tags (e.g. command test 4877). This could appear just about anywhere, e.g.
% pandoc -f html -t native
<blockquote>hi</blockquote>
[BlockQuote
[Plain [Str "hi"]]]
Test case 3510 involves org:
% pandoc -f org -t native
Text
#+include: "command/3510-subdoc.org"
#+INCLUDE: "command/3510-src.hs" src haskell
#+INCLUDE: "command/3510-export.latex" export latex
More text
^D
[Para [Str "Text"]
,Header 1 ("subsection",[],[]) [Str "Subsection"]
,Para [Str "Included",Space,Str "text"]
,Plain [Str "Lorem",Space,Str "ipsum."]
,CodeBlock ("",["haskell"],[]) "putStrLn outString\n"
,RawBlock (Format "latex") "\\emph{Hello}"
,Para [Str "More",Space,Str "text"]]
cat test/command/yaml-with-chomp.md
% pandoc -s -t native
---
ml: |-
TEST
BLOCK
...
^D
Pandoc (Meta {unMeta = fromList [("ml",MetaBlocks [Para [Str "TEST"],Plain [Str "BLOCK"]])]})
[]
I'm not sure I see a good way to separate the Plains that will need to have paragraph tags added to produce valid HTML and the ones that won't.
@jgm so the main difference between Para and Plain is that Plain avoids added whitespace in lists and tables? Is that true of TeX output formats as well? In particular, will a there be whitespace between a RawBlock and a Plain in LaTeX output?
@bpj, you can test it yourself:
% pandoc -t latex -f native
[Plain [Str "hi"], Plain [Str "hi"]]
hi
hi
The LaTeX writer is a bit different from others; it always inserts blank lines between block-level elements. This is sometimes undesirable, I know (#7111).
Faced the same issue writing a Lua filter for JATS output. The new pandoc.write
function helps a lot, but doesn't cover all uses cases.
I want to covert some native Divs into JATS statement
elements:
<statement>
<label> inlines </label>
<title> inlines </title>
</statement>
Note that the label
and title
elements can't be wrapped within <p>
tags and cannot contain <p>
tags. The following Lua code:
function Pandoc(doc)
inlines = pandoc.List:new(pandoc.Str('Some label text'))
inlines:insert(1, pandoc.RawInline('jats', '<label>')
inlines:insert(pandoc.RawInline('jats', '</label>')
doc.blocks:insert(pandoc.Plain(inlines))
end
Generates:
<p><label>Some label text</label></p>
Label or title may contain special elements, e.g. citations, so they shouldn't be simply stringified and inserted as a RawBlock. A better approach is to use pandoc.write
. We need to pass to pandoc.write at least the original citemethod and the document's metadata (for bibliography info and perhaps other settings). But we shouldn't pass all of PANDOC_WRITE_OPTIONS because (in Pandoc v2.17 and 2.18 at least) this will generate a full standalone output if Pandoc was called in standalone mode.
-- assuming doc.meta contains the document metadata
-- and label_inlines contains the label's inlines
function write_to_jats(inlines)
local result, mini_doc
local options = pandoc.WriterOptions({
cite_method = PANDOC_WRITER_OPTIONS.cite_method
})
mini_doc = pandoc.Pandoc(pandoc.Plain(inlines), doc.meta)
result = pandoc.write(mini_doc, 'jats', options)
return result:match('^<p>(.*)</p>$') or result or '' -- safely remove <p> tags
end
doc.blocks:insert(pandoc.RawBlock('jats', '<label>'..write_to_jats(label_inlines)..'</label>'))
There's still a limitation, however: if the inlines needed to be processed by another filter down the line (in my use case, pandoc-crossref
), they're lost.
I see the problem -- but JATS doesn't have a block-level container corresponding to Plain, so we have to treat it like Para or we'll get invalid JATS in other contexts.