pandoc Accessibility mode for LaTeX

PDFs produced using latex are not accessible. We could introduce a command-line option that causes the latex writer to include annotations for math (perhaps using the unicode fallback or even raw tex), image alt text, and more: http://ctan.math.washington.edu/tex-archive/macros/latex/contrib/oberdiek/accsupp.pdf

This package also includes an option that makes spaces visible to copy and paste (often when you copy from a latex-compiled PDF, spaces disappear).

Structural elements (paragraphs, lists, etc.) need to be tagged, and reading order indicated.

See also: https://www.tug.org/twg/accessibility/ http://web.science.mq.edu.au/~ross/TaggedPDF/ https://tex.stackexchange.com/questions/124291/revisiting-producing-structured-pdfs-from-latex (with information about using ConTeXT)

Mar 29 '19 21:03 jgm

Note that tagged PDFs are starting to be required at confrances such as SIG Access and ICAD Governments around the world, such as the United States, Ontario, Australia, European Union, and many other governments all require at minimum, all government PDFs to be properly tagged. This means any university receiving government money in the U.S. needs to have all their content be WCAG compliant. This means that if Pandoc has no way to produce properly tagged PDFs, it will not be legally usable by any institution that falls under the above mandates. I would rate this as an extremely high priority as the U.S. started requiring accessible PDFs from all government and entities receiving government money in January 2018 and EU started requiring any government sector website to have only accessible PDFs produced starting on September 23 2018. So millions of PDFs are effected by these requirements.

Apr 01 '19 19:04 frastlin

Agreed, it's an important issue. It also comes up for materials distributed in connection with courses. I'm motivated to make it easier to produce accessible PDFs using pandoc, but I need some guidance on the LaTeX side.

Brandon [email protected] writes:

Note that tagged PDFs are starting to be required at confrances such as SIG Access and ICAD Governments around the world, such as the United States, Ontario, Australia, European Union, and many other governments all require at minimum, all government PDFs to be properly tagged. This means any university receiving government money in the U.S. needs to have all their content be WCAG compliant. This means that if Pandoc has no way to produce properly tagged PDFs, it will not be legally usable by any institution that falls under the above mandates. I would rate this as an extremely high priority as the U.S. started requiring accessible PDFs from all government and entities receiving government money in January 2018 and EU started requiring any government sector website to have only accessible PDFs produced starting on September 23 2018. So millions of PDFs are effected by these requirements.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/5409#issuecomment-478716398

Apr 01 '19 19:04 jgm

Note that I've found none of the PDFs produced by Pandoc to be accessible, even from the HTML to PDF engines. Apparently PDFLib produces tagged PDFs, but that is it. The only way I have found to create accessible PDFs from Pandoc is to use Microsoft Word or Open Office to generate the accessible PDF.

Apr 01 '19 19:04 frastlin

Yes, if your university receives government money, or has an internal mandate to be accessible, you're required to have accessible content. There are 2 options with Pandoc:

Produce HTML or Epub, which are accessible (with proper formatting) right out of the box with Pandoc.
Use Word or Open Office (Make sure "Tagged PDFs is checked).

Apr 01 '19 19:04 frastlin

btw. PDF/A was already brought up once, and implemented using the context writer, see https://github.com/jgm/pandoc/issues/3215

Apr 02 '19 07:04 mb21

There's no question that this is important, but it needs more support to complete the LaTeX implementation; @u-fischer has been doing some great work with https://ctan.org/pkg/tagpdf.

Apr 02 '19 15:04 adunning

Relevant pandoc-discuss thread

Apr 02 '19 15:04 jgm

Just received this information from another Pandoc user on accessibility-meta.sty:

Revisiting producing structured PDF from LaTeX (2015) -- provides some useful tips on creating hopefully accessible PDFs with accessibility-meta.sty. There is a link to Github but it no longer works. i will provide one below the following link which is from Stack Exchange. If you are using Firefox you can cut out all of the clutter by pressing either F9 or Control+Alt+R depending on whether you are on Windows or Linux. If you are on a Mac I seem to remember the command being Command + Shift + R. I suspect you already know this though. :) https://tex.stackexchange.com/questions/124291/revisiting-producing-structured-pdfs-from-latex

Andy Clifton's Github repo for accessibility-meta.sty is: https://github.com/AndyClifton/AccessibleMetaClass

He calls this meta-class now so things may have changed somewhat. I should warn you that the most recent commit appears to be from 2 years ago. I should also say that I have not tried this myself recently.

Apr 04 '19 00:04 frastlin

I've tried using accessibility-meta.sty but without any success.

Apr 04 '19 16:04 jgm

@frastlin accessibility-meta doesn't work with luatex (and as with pdflatex you need to manually set all page breaks that is quite a problem). Also it isn't really extensible, e.g. to specific journal classes.

Apr 04 '19 19:04 u-fischer

Good to know. So it is looking like tagpdf is the best option for now. I would be more than happy to beta test the UX of tagged PDFs from Pandoc using tagpdf from a screen reader's perspective. I know almost nothing about LaTeX, so any testing I'll do will be from Markdown or HTML.

Apr 04 '19 19:04 frastlin

@frastlin I would be grateful if you could check the documentation (http://mirrors.ctan.org/macros/latex/contrib/tagpdf/tagpdf.pdf) and give some feedback. (I know that it has issues - but do find it difficult to judge how serious they are).

Apr 04 '19 21:04 u-fischer

I opened it and here are my comments:

when I opened it, the first message I got was: "Cannot extract the embedded font 'OZCXQN+LMSans10-Bold'. Some characters may not display or print correctly."
Acrobat does not ask me how to read the document, so first check passed.
Love the headings!
In the table of contents, the 1 doesn't have a link when all the other numbers do. I'm not sure why the numbers have the headings when the name of the heading is the name. Normally, in manuals, word table of contents, and Pandoc table of contents, the whole name of the heading is the link. When it is just the number, it's not always clear if the number is before or after the label, so I would much prefer the whole name be the link. I would also like the table of contents to be in a list. Here is what I see now:

1.
Introduction
link 2
1.1.
Tagging and accessibility...............................
link 3
1.2.
Engines and modes..................................
link 3
1.3.
References.......................................

(I added "Link" before the linked items). Here is what I would like:

List with 4 items
link 1. Introduction
link 1.1. Tagging and accessibility...............................
link 1.2. Engines and modes..................................
Link 1.3. References.......................................

Also note that links don't do anything when clicked. 5. I'm not seeing links for references. I see the [1], but it's not a link. 6. The list at 1.4. Validation, has the • on another line than the text, so it looks like:

•
One must check that the pdf is syntactically correct. It is rather easy to create broken pdf: e.g. if a chunk is opened on one page but closed on the next page.

Rather than:

• One must check that the pdf is syntactically correct. It is rather easy to create broken pdf: e.g. if a chunk is opened on one page but closed on the next page.

2.2. Setup and activation has a list that has no bullets, dashes, or numbers to differentiate the list items, but I can tell it's a list with 15 items.
I like the alt text: "PAC3 report" which is the first graphic.

This is very good, and I would use it today if I could! I would like to test tables if you could give me a document with tables.

Apr 05 '19 06:04 frastlin

@frastlin thank you very much for the comments. I copied them to https://github.com/u-fischer/tagpdf/issues/15 and commented there as this is not really a pandoc question.

Apr 05 '19 08:04 u-fischer

I have experience making PDFs readable by the computer voice so feel free to contact me. Here is what I posted on the pandoc mailing list:

In 2012 I worked with US public school student standardized tests in PDF format that had to be read by the computer voice for people with visual disabilities. Large-print PDFs were not enough for them. I can't remember the requirement law for US states that required this but it was a requirement for every US public school. What we discovered was all the text in a PDF is in a random order when you look at the actual internal structure of the PDF. So the computer read the text in a random order. I don't think this has changed in the PDF internal structure. What that meant for US states is we had to manually reorder every word in the PDF by hand which was enormously expensive and time-consuming.

I'm not sure what program made the PDF, all we received was the PDF to work with. It could have been from Quark as Quark is infamous for putting elements in random order when you export to a text file or Excel file. Maybe a PDF made from MS Word would be in a better order.

If you use Quark, you are severely limited with what you can do with that data later. If you want to export it to a text file and do something with the data an lot of time and expense will be used to clean the data up first and put it in a proper order and consistent manner. (My daily paid job is processing text files from various applications.)

Jun 20 '19 11:06 bulrush15

If you use --pdf-engine=context, a tagged PDF is produced by default. Moreover, the option pdfa creates a PDF/A-1b as standard, but if the option format=PDF/A-1b:2005, to setupbackend in the ConTeXt template, is changed to e.g PDF/A-2a , a PDF/A-2a (where the requirements include tagging) is produced instead. I have succeeded in validating files produced this way against the PDF/A-2a profile in veraPDF (the EU Preforma Project standard validator).

Jun 21 '19 14:06 klpn

@klpn If we produce a PDF/A-2a will the words be read by the computer in the proper order?

Jun 21 '19 16:06 bulrush15

A quick way to see how the computer reads the order is to select all and paste the output into a text file. The more difficult tags like heading, link, and table, need a viewer to check. But for headings, the text should be on its own line, similar if you paste the following content into a text editor:

Test Heading 1

This text will be on the line below the heading if you paste it into a text editor. If you have a PDF that is not tagged, and you don't have a program that can view the tags, then the heading will be on the same line as this text.

Jun 21 '19 16:06 frastlin

The text is in correct order for the files I have tested.

Jun 21 '19 19:06 klpn

The tags, with their textual content, can be inspected e.g. with the Poppler pdfinfo program, like pdfinfo -struct-text [pdffile]. The default ConTeXt template in Pandoc 2.7.2 seems to destroy word boundaries in this output. I changed it according to the ConTeXt wiki (gist with diff), which solves this problem for the files I have tested.

Jun 22 '19 06:06 klpn

@klpn Any downsides when using your gist? If not, would you like to make a pull request? For context: the pdfa tempalte variable was added in https://github.com/jgm/pandoc/commit/46f4238a2a40b5542612bc745e63ce503ce12a32

Jun 22 '19 07:06 mb21

I have not discovered any problems, but I should perhaps test with some more documents. However, the Pandoc manual explicitly states that the pdfa variable "adds to the preamble the setup necessary to generate PDF/A-1b:2005", so this should then be changed as well, if we want to always use 2a (i.e. version 2, level A conformance). When using PDF/A for documents born digital, it is best to use level A (which includes Unicode mapping and tagging) if possible, rather than B, but some older preservation guidelines still require version 1. Perhaps, the pdfa variable should be changed so that the user can choose which version of PDF/A to use (different PDF:s supported by ConTeXt)?

Jun 22 '19 07:06 klpn

@klpn I've created a new issue about the ConTeXt output: https://github.com/jgm/pandoc/issues/5608 Let's continue the discussion there in order to not spam this issue (which is about LaTeX output).

Jun 22 '19 08:06 mb21

The main disadvantage with the ConTeXt solution, I think, is that there is a lot of functionality implemented in LaTeX (e,g, beamer) where a ConTeXt reimplementation would be cumbersome. The tagpdf package, which has been mentioned, could be used to tag LaTeX documents. It does not add tags automatically, however. Perphaps, tags can be injected in the Pandoc AST, to create a structure like that shown in the tagpdf manual, sec. 3.5. I guess this would be hard to do using filters, and would rather require changes in the LaTeX writer?

Feb 11 '20 21:02 klpn

I think it would be possible (and not too hard) to add these tags using a lua filter. I don't see anything that would require changes to the writer.

For example, to get

\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend

we'd have a filter like (untested)

function tagBlock(label, el)
  return { pandoc.RawBlock("latex", "\\tagstructbegin{tag=" .. label ..
                  "}\n\\tagmcbegin{tag=" .. label .. "}", el,
                  pandoc.RawBlock("latex", "\\tagmcend\n\\tagstructend") }
end

function Header(el)
  return tagBlock("H", el)
end

And of course you can use tagBlock for other block-level elements too. To get the Sect tags you'd use mkSections first to get section Divs.

Feb 12 '20 05:02 jgm

Thanks, I will experiment a bit more with Lua filters and see if I can get accessible PDFs. Once we have properly tagged PDFs, it should also be possible to get PDF/A Level A from LaTeX via the pdfx package.

Feb 12 '20 07:02 klpn

A problem with a solution like that proposed by @jgm is for Beamer slides. This

# Pixedit

* Converts Office files

yields

\begin{frame}
\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\end{frame}

\begin{frame}{Pixedit}
\protect\hypertarget{pixedit}{}
\tagmcend
\tagstructend

\begin{itemize}
\tightlist
\item
  Converts Office files

The initial tagging commands before the header are placed in an empty frame, due to the way the writer divides frames from the Header structure when using beamer as output format.

Feb 12 '20 08:02 klpn

A few warnings ...

The tagpdf package, which has been mentioned, could be used to tag LaTeX documents. It does not add tags automatically, however.

Yes. This is explicitly not the purpose of the package. It is not a standard user package. The package has been written to give us (the latex team) and others a tool to investigate and experiment with tagging and to find out which changes in latex are needed.

I don't mind if you try to use it (actually I'm grateful for feedback) but the package is experimental and it is bound to change. For example in the development branch the internal module name has already been changed, for the handling of pdf internals another experimental package is now needed, the handling of artifacts will probably change.

You can get broken pdf if you don't use it correctly (and sometimes if you don't compile often enough to resolve all references). So you need tools to check the validity of the pdf.

For example, to get

\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend

Such simple code will normally not work with pdflatex as they can be page breaks in the wrong place resulting in broken pdf. With lualatex it is less problematic.

Feb 12 '20 08:02 u-fischer

@u-fischer thanks for the note. Does it work to use etoolbox's \apptocmd and \pretocmd to attach these things?

\pretocmd{\section}{\tagstructbegin{tag=H}\tagmcbegin{tag=H}}{}{}
\apptocmd{\section}{\tagcmdend\tagstructend}{}{}

This could go in the preamble and then the body would not need to change. It seems to me that a similar approach could be used to tag lots of other things, or am I missing something?

Feb 12 '20 16:02 jgm

@jgm that wouldn't change much (apart saving the user some typing). The command are still issued in vertical mode before and after the sectioning. For example if I compile this with pdflatex:

\documentclass{article}
\usepackage{tagpdf}
\tagpdfsetup{activate-all}
\begin{document}
some text 

\section{Section}
text after

\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend
text after

\vspace{33\baselineskip}
\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend
text after\\text \\text

\vspace{42\baselineskip}
\tagstructbegin{tag=H}
\tagmcbegin{tag=H}
\section{Section}
\tagmcend
\tagstructend
text after\\text \\text

\end{document}

then I get various problems. E.g. wrong spacing after the sections with tagging commands:

A page break after the third section and before the following text:

and because of the last section preflight reports wrong operators and a faulty pdf:

With lualatex the result are better: the pdf is valid and there is no page break after the section.

The side-effects mean that one has to inject the commands into the internal @startsection instead. And that is what makes the business so complicated: lots of internal code have to be reviewed and reworked to find suitable places for the tagging, at best without breaking existing documents.

Feb 12 '20 18:02 u-fischer