quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

well-formedness

Open faassen opened this issue 10 months ago • 6 comments

My understanding is that quick-xml supports reading XML but does not check for all well-formedness errors. It reports a whole bunch of them, but not everything. Since I'm interested in well-formedness I'm curious whether we could have a reader layered over the existing ones that does validate for well-formedness.

I'd like to make an inventory of what's missing:

  • while IllFormedError::MissingEndTag exists it is not actually produced by the reader normally, only if read_to_end is explicitly called.
  • putting illegal stuff on the top level such as multiple elements, text nodes, etc. Note that when dealing with XML fragments (as implied here, "fragment" has multiple meanings) it's possible to have multiple elements and text nodes on the top.
  • having a declaration without content is currently accepted

What other aspects of well-formedness did I miss that quick-xml currently does not check for? There are a whole cluster of them around entities, but I'm okay with ignoring DTDs entirely.

To implement checking whether tags are balanced, some kind of stack of which tags have been opened needs to be maintained. When considering how to do this efficiently I noticed that the internal ReaderState appears to maintain an efficient structure to track which elements have been started but not ended yet. But I don't think that's exposed to the outside world, is it? Could this indeed be useful for this?

Am I correct that quick-xml is pretty close in providing all the pieces already?

faassen avatar Feb 27 '25 16:02 faassen

I have a very WIP branch validation with of my vision for well-formedness checking. The main idea to introduce validation iterator for each event which will return errors and you may process them as you want.

Several observations:

  • Well-formedness restrictions are mainly related to DOCTYPE definitions and because we do not parse DOCTYPE (yet) they cannot be enforced
  • Unique attributes already may be enforced, but I plan to make it more explicit, by calling a method, so you may do not check it if you do not want
  • Check for matching open and close tag names already exists, but I also want to make it more explicit. The difference of this check is that all other checks is a check of event level, but this is a check of a reader level

If you are interested in helping in this direction, then I will be happy to.

Mingun avatar Mar 09 '25 09:03 Mingun

Very cool!

May I suggest though that using the word "validation" in the context of well-formedness is opening things up for confusion? The XML specification makes a very clear distinction between valid XML and well-formed XML.

faassen avatar Mar 09 '25 18:03 faassen

I'm open for suggestions.

Mingun avatar Mar 09 '25 20:03 Mingun

"WellFormed" is the obvious name, while a mouthful it's clear in what it does.

It's true entity declarations and resolution is out of scope. So well-formedness for what remains.

Concerning attributes, there's already a "with_checks" in the iterator which is true by default, but a general configuration option might be nicer to control this, is this what you're talking about?

check_end_names may be a different internal check but is already configurable nicely through Config at least.

faassen avatar Mar 10 '25 11:03 faassen

My proposed API would look like:

match reader.read_event()?.validated()? {
  ...
}

You will able to run checks only if you need them. If you are sure that XML is correct (it is from trusted source), you may skip them. Do you think .well_formed() would better? I'm not sure, that this method will check only for well-formedness errors, but I did not dig so deeply yet.

Mingun avatar Mar 10 '25 13:03 Mingun

quickxml is likely never going to be supporting a validating processor as defined by the XML core spec. https://www.w3.org/TR/REC-xml/#proc-types which involves DTD validation. To avoid any confusion I'd like to stay away from the word "valid" as much as possible. So well_formed() would be better from my perspective.

When thinking about it, I imagined a separate reader a bit like the namespace reader that just automatically rejects well-formedness errors. Though on the other hand that would require two versions - one for the non-namespace reader and one for the namespace one, which is not ideal.

What also seems to fit in the current approach is config options for well-formedness checks, along with one that turns on all the checks available. This is already how it works for some well-formedness checks after all.

Are you concerned that check whether this is enabled would increase the overhead for runs where this check is not required? I imagine you could make even the conditional check overhead go away with generics that just insert a "do-nothing" implementation of an internal well-formedness checker trait if no checking is required, but that risks making the code too complicated.

faassen avatar Mar 10 '25 14:03 faassen