WeasyPrint icon indicating copy to clipboard operation
WeasyPrint copied to clipboard

PDF/UA accessibility. Labeled strange.

Open marina31714 opened this issue 1 year ago • 5 comments

Hello, I'm trying to generate a PDF from HTML with PDF/UA, but it returns strange tagging. Is this labeling correct? Is there any way to modify it? It is the first time I use your library, and I am very interested in the accessibility part.

I am using Adobe Acrobat Pro to look at the labeling.

Thank you in advance.

HTML:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
    <title>Ejemplo PDF</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 50px;
        }
        h1 {
            color: pink;
        }
    </style>
</head>
<body>
    <h1>Hello World</h1>
    <p>Lorem ipsum dolor sit amet consectetur adipiscing elit pellentesque, eros blandit porttitor primis mollis nisi in nunc, ante interdum vestibulum viverra mattis et sociosqu. Faucibus a risus laoreet posuere placerat class tempus vehicula, dignissim congue netus odio potenti phasellus malesuada sodales habitant, egestas id imperdiet sociis vitae taciti curabitur.</p>
</body>
</html>

Python (Flask):

from flask import Flask, render_template, make_response
from weasyprint import HTML

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/pdf')
def generate_pdf():
    HTML('./templates/index.html').write_pdf('test_pdf_ua.pdf', pdf_version="1.6",  pdf_variant='pdf/ua-1')
    return "PDF Generated"

if __name__ == '__main__':
    app.run(debug=True)

Result: image

Expected result: image

marina31714 avatar May 09 '24 14:05 marina31714

Hmmm… There’s something strange in these labels, we have to check what’s wrong and try to improve this structure.

liZe avatar May 29 '24 15:05 liZe

Here’s a PDF with new labels, could you please test how it works in your PDF reader?

ua.pdf

(By the way, what’s your PDF reader?)

liZe avatar Aug 13 '24 05:08 liZe

I too wondered about this NonStruct in the structure tree generated by Weasyprint... It is in the standard (PDF 32000-1:2008 page 584):

NonStruct(Nonstructural element) A grouping element having no inherent structural significance; it serves solely for grouping purposes. This type of element differs from a division (structure type Div) in that it shall not be interpreted or exported to other document formats; however, its descendants shall be processed normally.

But I am not really sure why it needs to be used in this case - it appears Weasyprint is treating <html> and <body> as "nonstructural elements" wrapped around the text content. Probably, it could just not do that - but on the other hand it isn't necessarily incorrect to do so (after all, they are grouping elements having no inherent structural significance), just unexpected, since other PDF/UA tools (like Microsoft Word) don't do it.

The above ua.pdf is definitely not correct though. pdfinfo -struct won't even read it:

$ pdfinfo -struct ~/Downloads/ua.pdf
Syntax Error: StructElem object is wrong type (None)
Syntax Error: StructElem object is wrong type (None)
Document

You can also look at structure trees with (I'm biased because I contributed this functionality) pdfplumber --structure-text which gives JSON and tries to be tolerant of invalid structure trees (of which there are many):

[{"type": "Document", "children": [
  {"type": "None", "page_number": 1, "children": [
    {"type": "None", "page_number": 1, "children": [
      {"type": "H1", "page_number": 1, "mcids": [0], "text": ["Hello World"]},
      {"type": "P", "page_number": 1, "mcids": [1], "text": ["Lorem ipsum dolor sit amet consectetur adipiscing elit pellentesque, erosblandit porttitor primis mollis nisi in nunc, ante interdum vestibulum viverramattis et sociosqu. Faucibus a risus laoreet posuere placerat class tempusvehicula, dignissim congue netus odio potenti phasellus malesuada sodaleshabitant, egestas id imperdiet sociis vitae taciti curabitur."]}]}]}]}]

Clearly None is not in the standard as a structure element ;-)

dhdaines avatar Sep 10 '24 00:09 dhdaines

If you’re interested in this issue, it’s time to test #2471! Feedback will be highly appreciated, even to say that it just works. 🙏

liZe avatar Jun 06 '25 15:06 liZe

If you’re interested in this issue, it’s time to test #2471! Feedback will be highly appreciated, even to say that it just works. 🙏

Thanks! This is timely since I'm doing a bunch of work on logical structure trees in PLAYA so I can use it to verify and vice versa.

dhdaines avatar Jun 11 '25 11:06 dhdaines