html5ever
html5ever copied to clipboard
Add XML fragment parsing to xml5ever
Since I deleted the xml5ever repostiory, I've decided to move this issue here. This was a feature @nox asked, and it seems that it's a pain point in servo as well see servo/servo#11995.
The algorithm for parsing xml fragment is in HTML5 spec.
Basically, what's needed is to construct a context element with given name, a NamespaceMap, and input string.
It's advised to use parse_fragment from h5e as reference point.
(Digression)
(Digression)
I'll fix any breakage if possible. The only issue I see is xml5ever Travis badge. Unless there are some invisible Git links.
(Digression)
I wasn't aware there were any links to it.
That’s how the web works. Everything has an URL, and someone doesn’t need to tell you when they start using that URL somewhere. This is why cool URLs don’t change.
(Digression)
Sorry for going off-topic. Back to the actual issue, is an actual NamespaceMap really required or only a resolved (prefix, ns_url, local) name?
(Digression)
If I understood spec correctly when it says:
declaring all the namespace prefixes that are in scope on that element in the DOM, as well as declaring the default namespace (if any) that is in scope on that element in the DOM.
That means function parsing XML fragment needs pairs of (prefix, ns_url) as input, plus a way to denote default namespace, so it could create documents with all QualName fields assigned.
To answer the question, I don't think it's necessary to send a NamespaceMap, we just need the tuples (prefix, ns_url), possibly using some sort of common interface like IntoIter could work.
Ah, I see. The part I was missing is that the "starting point" of parse_fragment also includes all of the context’s namespace declarations, not just the for the "context element".
I think the question that remains is parse_fragments definition:
- With NamespaceMap - I think this might be revealing too much internal implementation
fn parse_fragment<Sink>(mut sink: Sink, opts: ParseOpts,
context_name: StrTendril, context_namespace: NamespaceMap)
-> Parser<Sink>
- With
IntoIter<(prefix, ns_url)>with default namespace
fn parse_fragment<Sink>(mut sink: Sink, opts: ParseOpts,
context_name: IntoIter<(Prefix, Namespace)>,
default_namespace: Option<Namespace>)
-> Parser<Sink>
- With
IntoIter<(Option<prefix>, ns_url)>with default namespace baked inIntoIter(although that leaves the nasty option of what to do when multiple(None, ns_url)pairs)
fn parse_fragment<Sink>(mut sink: Sink, opts: ParseOpts,
context_name: IntoIter<(Option<prefix>, ns_url)>)
-> Parser<Sink>
- Or some other option.
In selectors, users provide a generic thing with lookup methods:
https://github.com/servo/servo/blob/master/components/selectors/parser.rs#L106-L113
fn default_namespace(&self) -> Option<<Self::Impl as SelectorImpl>::NamespaceUrl> {
None
}
fn namespace_for_prefix(&self, _prefix: &<Self::Impl as SelectorImpl>::NamespacePrefix)
-> Option<<Self::Impl as SelectorImpl>::NamespaceUrl> {
None
}
(NamespaceUrl and NamespacePrefix are also generic string types in selectors, but for xml5ever they can be concrete markup5ever::Namespace and markup5ever::Prefix.)
It's advised to use parse_fragment from h5e as reference point.
Maybe not. Servo does not use parse_fragment (or anything in the driver module), it uses TreeBuilder::new_for_fragment and tokenizer_state_for_context_elem directly.
I don’t know if there is a use for fragment parsing outside of a browser engine. To the extend that Servo is the only user, there is no need to spend too much time working on a nice API that minimize the amount of user code.
I’m considering removing html5ever::driver::parse_fragment*, which I believe nobody use. @nox what do you think?
Is there any progress or plan?
From the above discussion, I understand that only a XmlTreeBuilder::new_for_fragment needs to be implemented, right?
Maybe I will work on this, in order to implement parse_xml_fragment in Servo. But I'm not sure how complicated it is, I need to get familiar with xml5ever.
Dunno. I had some plans but they have fallen down to the wayside due to RL issues and other projects.
You can have a crack at it. It's not super complicated, but also not super newbie-friendly.