publ-a11y Inferring accessibility metadata when none is present

On the October 26,2023 call, the question came up about what to do when no or little accessibility metadata is present. Should this group be providing guidance about inferring metadata when it is not present. In the past we have said that a reflowable EPUB with a detailed nave doc is normally very accessible. It is also possible for the EPUB to be examined for accessibility features. So the issue is what guidance should we be providing about a distributor, for example, adding accessibility metadata to their catalogue that can be inferred by examination of the title?

Oct 27 '23 15:10 GeorgeKerscher

Hadrien Gardeur of DeMarque did a presentation about the lack of accessibility metadata at the EDItEUR Supply Chain Conference at the Frankfurt Book Fair and the slides are available for all on the EDItEUR website here: https://editeur.org/3/Events/Event-Details/667

Oct 30 '23 10:10 chrisONIX

Readium go toolkits Inferred metadata (work in progress) explores the path. We'll be happy to discuss the subject collectively.

Oct 31 '23 07:10 gautierchomel

I think this is an important issue. I see different organizations moving toward that, with the risk of different interpretations of how to do metadata infer. I think joint work will be needed to define high-level guidelines on how to analyze code to extract metadata in a consistent way across different implementations.

In terms of UX guidelines the aspect we will have to consider is whether to indicate to the end user if a piece of metadata comes from the content creator or from an inferring algorithm. To be considered is if we should add this information and with what level of granularity.

Oct 31 '23 07:10 gregoriopellegrino

Thank you for the link to the slides @gautierchomel . Interesting to see what Readium has seen.

At look at our titles:

Total titles: 3,084,444
Titles that are EPUB: 1,649,667
EPUB titles with any a11y metadata claim: 260,272
EPUB titles with conformsTo: 27,280
EPUB titles with certifiedBy: 28,492
EPUB titles with MathML: 48,551
EPUB titles with displayTransformability: 145,518
EPUB titles with accessModeSufficient=textual: 113,338

Nov 02 '23 02:11 rickj

As mentioned by @chrisONIX, @gautierchomel and @gregoriopellegrino we've been working on various things over the last 18 months at De Marque:

unifying how we represent accessibility metadata in all our systems
extracting and inferring metadata from EPUB files (which we contribute back to the community through the Readium Web project and the rwp utility)
comparing this data to what we receive in ONIX

The data covered in my presentation in Frankfurt comes primarily from trade publishing, which is probably quite a different dataset from what @rickj has on his side.

The logic for our inference rules is entirely open source, but I can summarize it here:

for now, we focus on low-hanging fruits with inferences that apply strictly to the OPF, Navigation Document and NCX
we prefer a conservative approach rather than inferring metadata more massively

Here's the list of current rules:

if the publication is a reflowable EPUB, does not contain any image/audio/video or the only image is identified as a cover, we infer textual (on its own and not combined with other values) for accessModeSufficient
if the publications contains a table of contents in its Navigation Document or NCX, we infer tableOfContents for accessibilityFeature
if the publications contains a page list in its Navigation Document or NCX, we infer printPageNumbers for accessibilityFeature
if the publications contains MathML, we infer MathML for accessibilityFeature
if the publications contains SMIL files (media overlay), we infer synchronizedAudioText for accessibilityFeature
if the publication contains video or audio resources, we infer auditory for accessMode
if the publication contains images or video resources, we infer visual for accessMode

For the table of contents and page list, this could be refined by:

looking at the number of items in either list
comparing the number of items in a page list with the number of pages documented in the EPUB metadata or with the number of positions (that we automatically calculate for every book)
comparing the number of items in a table of contents to the number of items in the reading order (spine)

That said, we've seen EPUB that had good reasons for only having a smallish table of contents or a partial page list, so it's pretty hard to define a rule that works across all publications.

Nov 02 '23 11:11 HadrienGardeur

publ-a11y publ-a11y copied to clipboard

Inferring accessibility metadata when none is present

publ-a11y
publ-a11y copied to clipboard