PDF/UA accessibility. Labeled strange.
Hello, I'm trying to generate a PDF from HTML with PDF/UA, but it returns strange tagging. Is this labeling correct? Is there any way to modify it? It is the first time I use your library, and I am very interested in the accessibility part.
I am using Adobe Acrobat Pro to look at the labeling.
Thank you in advance.
HTML:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>Ejemplo PDF</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 50px;
}
h1 {
color: pink;
}
</style>
</head>
<body>
<h1>Hello World</h1>
<p>Lorem ipsum dolor sit amet consectetur adipiscing elit pellentesque, eros blandit porttitor primis mollis nisi in nunc, ante interdum vestibulum viverra mattis et sociosqu. Faucibus a risus laoreet posuere placerat class tempus vehicula, dignissim congue netus odio potenti phasellus malesuada sodales habitant, egestas id imperdiet sociis vitae taciti curabitur.</p>
</body>
</html>
Python (Flask):
from flask import Flask, render_template, make_response
from weasyprint import HTML
app = Flask(__name__)
@app.route('/')
def index():
return render_template('index.html')
@app.route('/pdf')
def generate_pdf():
HTML('./templates/index.html').write_pdf('test_pdf_ua.pdf', pdf_version="1.6", pdf_variant='pdf/ua-1')
return "PDF Generated"
if __name__ == '__main__':
app.run(debug=True)
Result:
Expected result:
Hmmm… There’s something strange in these labels, we have to check what’s wrong and try to improve this structure.
Here’s a PDF with new labels, could you please test how it works in your PDF reader?
(By the way, what’s your PDF reader?)
I too wondered about this NonStruct in the structure tree generated by Weasyprint... It is in the standard (PDF 32000-1:2008 page 584):
NonStruct(Nonstructural element) A grouping element having no inherent structural significance; it serves solely for grouping purposes. This type of element differs from a division (structure type Div) in that it shall not be interpreted or exported to other document formats; however, its descendants shall be processed normally.
But I am not really sure why it needs to be used in this case - it appears Weasyprint is treating <html> and <body> as "nonstructural elements" wrapped around the text content. Probably, it could just not do that - but on the other hand it isn't necessarily incorrect to do so (after all, they are grouping elements having no inherent structural significance), just unexpected, since other PDF/UA tools (like Microsoft Word) don't do it.
The above ua.pdf is definitely not correct though. pdfinfo -struct won't even read it:
$ pdfinfo -struct ~/Downloads/ua.pdf
Syntax Error: StructElem object is wrong type (None)
Syntax Error: StructElem object is wrong type (None)
Document
You can also look at structure trees with (I'm biased because I contributed this functionality) pdfplumber --structure-text which gives JSON and tries to be tolerant of invalid structure trees (of which there are many):
[{"type": "Document", "children": [
{"type": "None", "page_number": 1, "children": [
{"type": "None", "page_number": 1, "children": [
{"type": "H1", "page_number": 1, "mcids": [0], "text": ["Hello World"]},
{"type": "P", "page_number": 1, "mcids": [1], "text": ["Lorem ipsum dolor sit amet consectetur adipiscing elit pellentesque, erosblandit porttitor primis mollis nisi in nunc, ante interdum vestibulum viverramattis et sociosqu. Faucibus a risus laoreet posuere placerat class tempusvehicula, dignissim congue netus odio potenti phasellus malesuada sodaleshabitant, egestas id imperdiet sociis vitae taciti curabitur."]}]}]}]}]
Clearly None is not in the standard as a structure element ;-)
If you’re interested in this issue, it’s time to test #2471! Feedback will be highly appreciated, even to say that it just works. 🙏