publ-a11y
publ-a11y copied to clipboard
Inferring accessibility metadata when none is present
On the October 26,2023 call, the question came up about what to do when no or little accessibility metadata is present. Should this group be providing guidance about inferring metadata when it is not present. In the past we have said that a reflowable EPUB with a detailed nave doc is normally very accessible. It is also possible for the EPUB to be examined for accessibility features. So the issue is what guidance should we be providing about a distributor, for example, adding accessibility metadata to their catalogue that can be inferred by examination of the title?
Hadrien Gardeur of DeMarque did a presentation about the lack of accessibility metadata at the EDItEUR Supply Chain Conference at the Frankfurt Book Fair and the slides are available for all on the EDItEUR website here: https://editeur.org/3/Events/Event-Details/667
Readium go toolkits Inferred metadata (work in progress) explores the path. We'll be happy to discuss the subject collectively.
I think this is an important issue. I see different organizations moving toward that, with the risk of different interpretations of how to do metadata infer. I think joint work will be needed to define high-level guidelines on how to analyze code to extract metadata in a consistent way across different implementations.
In terms of UX guidelines the aspect we will have to consider is whether to indicate to the end user if a piece of metadata comes from the content creator or from an inferring algorithm. To be considered is if we should add this information and with what level of granularity.
Thank you for the link to the slides @gautierchomel . Interesting to see what Readium has seen.
At look at our titles:
- Total titles: 3,084,444
- Titles that are EPUB: 1,649,667
- EPUB titles with any a11y metadata claim: 260,272
- EPUB titles with conformsTo: 27,280
- EPUB titles with certifiedBy: 28,492
- EPUB titles with MathML: 48,551
- EPUB titles with displayTransformability: 145,518
- EPUB titles with accessModeSufficient=textual: 113,338
As mentioned by @chrisONIX, @gautierchomel and @gregoriopellegrino we've been working on various things over the last 18 months at De Marque:
- unifying how we represent accessibility metadata in all our systems
- extracting and inferring metadata from EPUB files (which we contribute back to the community through the Readium Web project and the rwp utility)
- comparing this data to what we receive in ONIX
The data covered in my presentation in Frankfurt comes primarily from trade publishing, which is probably quite a different dataset from what @rickj has on his side.
The logic for our inference rules is entirely open source, but I can summarize it here:
- for now, we focus on low-hanging fruits with inferences that apply strictly to the OPF, Navigation Document and NCX
- we prefer a conservative approach rather than inferring metadata more massively
Here's the list of current rules:
- if the publication is a reflowable EPUB, does not contain any image/audio/video or the only image is identified as a cover, we infer
textual
(on its own and not combined with other values) foraccessModeSufficient
- if the publications contains a table of contents in its Navigation Document or NCX, we infer
tableOfContents
foraccessibilityFeature
- if the publications contains a page list in its Navigation Document or NCX, we infer
printPageNumbers
foraccessibilityFeature
- if the publications contains MathML, we infer
MathML
foraccessibilityFeature
- if the publications contains SMIL files (media overlay), we infer
synchronizedAudioText
foraccessibilityFeature
- if the publication contains video or audio resources, we infer
auditory
foraccessMode
- if the publication contains images or video resources, we infer
visual
foraccessMode
For the table of contents and page list, this could be refined by:
- looking at the number of items in either list
- comparing the number of items in a page list with the number of pages documented in the EPUB metadata or with the number of positions (that we automatically calculate for every book)
- comparing the number of items in a table of contents to the number of items in the reading order (spine)
That said, we've seen EPUB that had good reasons for only having a smallish table of contents or a partial page list, so it's pretty hard to define a rule that works across all publications.