epubcheck
epubcheck copied to clipboard
Make sure Schematron checks normalize whitespace when needed
Spawned from this review comment:
It's actually a problem all over with attribute tests. XML processors will normalize whitespace in attributes, but there are lots of tests where we're not taking that into account. For example, id=" foo " is valid, but unique-identifier="foo" won't properly test it, and vice versa. In some cases, like the previous, it won't matter because an exact match is required, but overall the tests are a combination of being too brittle and too strict.
It was a long time ago when I was involved in the design of XML. But leading and trailing whitespace is dropped only when the DTD indicates that the attribute is not CDATA and the XML processor encounters the attribute declaration.
We should really say something about expectations when processing attributes. We define that whitespace must be trimmed from elements, but are silent about attributes.
The issue I'm finding is that it is not invalid to have leading/trailing space in attributes, but our tests don't account for it (schematron doesn't trim values when operating on them, or comparing them, unless it is explicitly called to do so by the test). So if we consider it as part of the value, then we have the opposite problem that in very few cases are we alerting users that they are including whitespace where they shouldn't.
Reading Systems MUST trim all leading and trailing white space from the element value, as defined by the XML specification [XML], before processing the value.
This sentence is misleading. XML defines whitespace, but it does not specify that whitespace in element contents be dropped.
@mattgarrish
I think that it is perfectly fine to have leading/trailing whitespace for attribute values. Schematron rules have to be fixed.
I think that it is perfectly fine to have leading/trailing whitespace for attribute values. Schematron rules have to be fixed.
Agree, and that's what I was proposing in the other thread. We should be normalizing space in all attribute values for testing, and assuming that reading systems are doing the same.
To go back to what I said there, having id=" pub01 "
and unique-identifier="pub01"
shouldn't create an error, nor should id="pub01"
and unique-identifier=" pub01 "
, but they do.
Similarly, we're not checking the validity of a lot of attributes if their values include whitespace, as exact matches are expected.
This sentence is misleading. XML defines whitespace, but it does not specify that whitespace in element contents be dropped.
Yes, I think the intention was XML defines the white space that must be trimmed from the beginning and ending of each value, as the action to take is clearly defined already, only what characters is ambiguous. Let's open this in the epub revision tracker and fix.
@mattgarrish if you can find the time, could you review if there's anything to fix in EPUBCheck here? (OK if you can't, let me know!)
I fixed the problem for the package schematron tests when I was updating this earlier, but looking at some of the other schematron schemas I can see the same problem. There are a lot of tests that don't account for spaces when doing equality checks on attribute values, so will potentially not be triggered when they should.
I can spend some time today cleaning these.
I can spend some time today cleaning these.
OK! Note that this may not be super critical either (we've never been reported the use case), so we could postpone it to a later patch release if it's too time consuming.
Also, this is something that can be handled at parsing time: if there's an easy way to compile the list of attributes which allow leading/trailing whitespace, we could trim them when we parse them, so that the schema only see the trimmed value.
Let me know if this latter option sounds easier, and I can quickly implement if you can compile the list of attributes.
Ya, I think I might have started by opening the worst file, too. I didn't notice it was the ocf-metadata.sch file, which was a copy of the package document tests we created for multiple renditions. Hardly worth the effort of fixing.
I think we're fine sitting on this until later. The most important file is fixed.