html5ever icon indicating copy to clipboard operation
html5ever copied to clipboard

Add XML fragment parsing to xml5ever

Open Ygg01 opened this issue 8 years ago • 12 comments

Since I deleted the xml5ever repostiory, I've decided to move this issue here. This was a feature @nox asked, and it seems that it's a pain point in servo as well see servo/servo#11995.

The algorithm for parsing xml fragment is in HTML5 spec.

Basically, what's needed is to construct a context element with given name, a NamespaceMap, and input string. It's advised to use parse_fragment from h5e as reference point.

Ygg01 avatar May 05 '17 15:05 Ygg01

(Digression)

What? Why delete the repository? A number of existing links are gonna be broken :/

SimonSapin avatar May 05 '17 15:05 SimonSapin

(Digression)

I was doing some spring cleaning and removing repositories I didn't need. I wasn't aware there were _any_ links to it.

I'll fix any breakage if possible. The only issue I see is xml5ever Travis badge. Unless there are some invisible Git links.

Ygg01 avatar May 05 '17 16:05 Ygg01

(Digression)

The code and git history lives on, but a *github* repository also houses a pull request and issue tracker. These contain discussion of why things were done they way they were. For example https://github.com/servo/html5ever/issues/266 links to https://github.com/Ygg01/xml5ever/issues/30. These discussions can still be relevant later.

I wasn't aware there were any links to it.

That’s how the web works. Everything has an URL, and someone doesn’t need to tell you when they start using that URL somewhere. This is why cool URLs don’t change.

SimonSapin avatar May 05 '17 18:05 SimonSapin

(Digression)

What can I say - Mea culpa. I'll keep that in mind in future.

Ygg01 avatar May 08 '17 10:05 Ygg01

Sorry for going off-topic. Back to the actual issue, is an actual NamespaceMap really required or only a resolved (prefix, ns_url, local) name?

SimonSapin avatar May 11 '17 17:05 SimonSapin

(Digression)

No worries, you were right about it. It was a foolish decision on my part.
-------

If I understood spec correctly when it says:

declaring all the namespace prefixes that are in scope on that element in the DOM, as well as declaring the default namespace (if any) that is in scope on that element in the DOM.

That means function parsing XML fragment needs pairs of (prefix, ns_url) as input, plus a way to denote default namespace, so it could create documents with all QualName fields assigned.

To answer the question, I don't think it's necessary to send a NamespaceMap, we just need the tuples (prefix, ns_url), possibly using some sort of common interface like IntoIter could work.

Ygg01 avatar May 11 '17 23:05 Ygg01

Ah, I see. The part I was missing is that the "starting point" of parse_fragment also includes all of the context’s namespace declarations, not just the for the "context element".

SimonSapin avatar May 12 '17 08:05 SimonSapin

I think the question that remains is parse_fragments definition:

  • With NamespaceMap - I think this might be revealing too much internal implementation
fn parse_fragment<Sink>(mut sink: Sink, opts: ParseOpts,
                        context_name: StrTendril, context_namespace: NamespaceMap)
-> Parser<Sink>
  • With IntoIter<(prefix, ns_url)> with default namespace
fn parse_fragment<Sink>(mut sink: Sink, opts: ParseOpts,
                        context_name: IntoIter<(Prefix, Namespace)>, 
                        default_namespace: Option<Namespace>)
-> Parser<Sink>
  • With IntoIter<(Option<prefix>, ns_url)> with default namespace baked in IntoIter (although that leaves the nasty option of what to do when multiple (None, ns_url) pairs)
fn parse_fragment<Sink>(mut sink: Sink, opts: ParseOpts,
                        context_name: IntoIter<(Option<prefix>, ns_url)>)
-> Parser<Sink>
  • Or some other option.

Ygg01 avatar May 12 '17 13:05 Ygg01

In selectors, users provide a generic thing with lookup methods:

https://github.com/servo/servo/blob/master/components/selectors/parser.rs#L106-L113

    fn default_namespace(&self) -> Option<<Self::Impl as SelectorImpl>::NamespaceUrl> {
        None
    }

    fn namespace_for_prefix(&self, _prefix: &<Self::Impl as SelectorImpl>::NamespacePrefix)
                            -> Option<<Self::Impl as SelectorImpl>::NamespaceUrl> {
        None
    }

(NamespaceUrl and NamespacePrefix are also generic string types in selectors, but for xml5ever they can be concrete markup5ever::Namespace and markup5ever::Prefix.)

SimonSapin avatar May 13 '17 00:05 SimonSapin

It's advised to use parse_fragment from h5e as reference point.

Maybe not. Servo does not use parse_fragment (or anything in the driver module), it uses TreeBuilder::new_for_fragment and tokenizer_state_for_context_elem directly.

I don’t know if there is a use for fragment parsing outside of a browser engine. To the extend that Servo is the only user, there is no need to spend too much time working on a nice API that minimize the amount of user code.

I’m considering removing html5ever::driver::parse_fragment*, which I believe nobody use. @nox what do you think?

SimonSapin avatar May 21 '17 19:05 SimonSapin

Is there any progress or plan?

From the above discussion, I understand that only a XmlTreeBuilder::new_for_fragment needs to be implemented, right?

Maybe I will work on this, in order to implement parse_xml_fragment in Servo. But I'm not sure how complicated it is, I need to get familiar with xml5ever.

s6nqou avatar Apr 12 '22 09:04 s6nqou

Dunno. I had some plans but they have fallen down to the wayside due to RL issues and other projects.

You can have a crack at it. It's not super complicated, but also not super newbie-friendly.

Ygg01 avatar Apr 13 '22 20:04 Ygg01