pandoc
pandoc copied to clipboard
ConTeXt writer: tag paragraphs
Paragraphs are enclosed by \startparagraph
and \stopparagraph
commands. This ensures better tagging results in PDF output.
Demonstrating the difference (extracted from the PDF with pdfinfo -struct-text
):
before
Div
"Term Paper TitleBird2022-01-23"
Sect "section"
Div
H (block)
"Knuth"
Div
"Thus, I came to the conclusion that the designer of a new system must not onlybe the implementer and first large–scale user; the designer should also write thefirst user manual.The separation of any of these four components would have hurt TeX significantly. IfI had not participated fully in all these activities, literally hundreds of improvementswould never have been made, because I would never have thought of them orperceived why they were important.But a system cannot be successful if it is too strongly influenced by a single person.Once the initial design is complete and fairly robust, the real test begins as peoplewith many different viewpoints undertake their own experiments."
after
"Term Paper TitleBird2022-01-23"
Sect "section"
Div
H (block)
"Knuth"
Div
P (block)
"Thus, I came to the conclusion that the designer of a new system must not onlybe the implementer and first large–scale user; the designer should also write thefirst user manual."
P (block)
"The separation of any of these four components would have hurt TeX significantly. IfI had not participated fully in all these activities, literally hundreds of improvementswould never have been made, because I would never have thought of them orperceived why they were important."
P (block)
"But a system cannot be successful if it is too strongly influenced by a single person.Once the initial design is complete and fairly robust, the real test begins as peoplewith many different viewpoints undertake their own experiments."
I'm just not sure if the additional verbosity is worth it.
CC: @denismaier @klpn
Sounds like a win to me, but let's see what the ConTeXt experts say.
In general that's a useful addition, especially when going directly to PDF. Maybe, if you convert to context sources, some might the less verbose alternative. Maybe a new command line option could be useful?
I suppose we could add an extension like tags
or pagaraph_tags
. But I'm not sure how important this would be. If someone is using pandoc to generate ConTeXt that will then be hand-edited, and they don't want these things, they could always pipe the output through
sed -E -e '/\\(start|stop)paragraph/d'
Interested in more feedback on this from ConTeXt users...
Actually, the ConTeXt writer has access to variables, so why don't we just activate this feature if the pdfa
variable is set? Would that be sensitive?
Checking the pdfa
variable would make sense, IMHO.
It seems that there are number of additional cases where we could improve tagging, e.g., in lists or for emphasized text: the ConTeXt wiki recommends to define \definehighlight[emph][style={\em}]
and use \emph{text}
instead of the normal {\em text}
, as the former produces better tagging. The Export page in the wiki has a couple of additional examples. The end result looks quite different from "normal" ConTeXt, so yet another extension would be justifiable, too.
Checking the pdfa
variable is easy but slightly unprincipled. (Variables are supposed to be for template inclusion, so it's always a bit odd when they affect the body too.) So maybe adding a tagging
extension would make sense. Not sure.
I'm tempted to leave things as they are, but to use tagging as motivation for the new writer style and make_variant
function that you suggested. Tagging-friendly ConTeXt would be a prime usecase for this.
I just found out about the effect of --section-divs
on ConTeXt output (#2609). I think it might make sense to hide the suggested behavior behind that switch.
I think having a separate tagging extension might be more principled. It wouldn't really be obvious why --section-divs
ALSO puts tags around paragraphs.
Or maybe it wouldn't it be that bad just to do the paragraph tagging by default for ConTeXt? It's the way of the future, presumably.
I see, that's true. If we merge this, would it make sense to let the --section-divs
behavior be the default? It seem like that would be the most consistent.
If we merge this, would it make sense to let the --section-divs behavior be the default?
Agreed. I guess that would mean that we're only targeting ConTeXt IV, since older versions don't support the \start/stopsection
. But at this point that's probably quite sensible. I think I'd be in favor of the simplest solution, and this is probably it.
The question @denismaier raised above is about the increased verbosity. I don't know how much of an issue that is for ConTeXt user.
I was informed on the ConTeXt mailing list that using \startparagraph ... \stopparagraph
leads to problems in some cases, e.g. in list items. The workaround is to use \bpar ... \epar
instead. I don't understand yet whether it's preferable to always use those commands, or to use them only where \startparagraph
would lead to unexpected results.
I went ahead with the additional extension: if tagging
is enabled, all paragraphs are wrapped in \bpar
/\epar
commands. Furthermore, we then generate \definehighlight
commands for all used emphasis types and inject them via the emphasis-commands
template variable.
Docs haven't been updated yet.