url
url copied to clipboard
Provide a grammar for the URL parser
As an occasional standards-user, the lack of a succinct expression of the grammar for valid URL strings is rather frustrating. It makes it rather difficult to follow what's going on and, in particular, to work out whether a given thing is a valid URL. A grammar in EBNF or a similar form would be greatly appreciated and make this spec significantly easier to understand.
Just to be sure, you're saying you prefer something like
url-query-string = url-unit{0,}
(or whatever grammar format) to the spec's current
A URL-query string must be zero or more URL units.
?
I think this is #24 #416
Apologies for the slow reply.
I think that @masinter is right, and that my concern matches the ones discussed there. I skimmed those threads and a few statements concerned me, such as the assertion that a full Turing machine is required to parse URLs: this would very much surprise me; my instinct on reading the grammar is that, once you separate out the different paths for file, special schemes, and non-special schemes in relative URLs, the result is almost certainly context-free. It might even be regular. The fact that the given algorithm mostly does a single pass is a strong indication that complex parsing is not required.
I discussed with a colleague of mine, @tabatkins, and they said that the CSS syntax parser was much improved when it was rewritten from a state-machine based parser into a recursive descent parser. Doing this would effectively require writing the URL grammar out as a context-free grammar, which would make providing a BNF-like specification, even if it's only informative, very easy.
Separately, but not entirely, splitting out the parsing from the semantic functions (checking some validity rules, creating the result URL when parsing a relative URL string) would likely improve the readability of the spec and the simplicity of implementing it. I think this might be better suited for a separate thread, though, as there are some other thoughts I have in this vein as well.
This might be a more complicated problem than you think (@alercah). I have tried several times, but the scheme dependent behaviour causes a lot of duplicate rules, so you end up with a grammar that is not very concise nor easy to read. And there is a tricky problem with repeated slashes before the host, the handling of which is base URL dependent.
I have some notes on it here. (I eventually went with a hybrid approach of a couple of very simple grammars and some logic rules in between). This ties into a model of URLs that I describe here.
What's the status of this? It really does work. I developed the theory when I tried to write a library that supports relative URLs. I am quite confident that it matches the standard (but not everything is described in the notes); as the library now passes all of the parsing tests.
It's not supported by vanilla BNF, but I would personally be quite satisfied with a grammar taking parameterized rules like you have there. Many modern parser generators can handle them; those that cannot, it is relatively easy to split out the (small number of) parameters here.
On Wed, 14 Oct 2020 at 11:33, Alwin Blok [email protected] wrote:
This might be a more complicated problem than you think. I have tried several times, but the scheme dependent behaviour causes a lot of duplicate rules, so you end up with a grammar that is not very concise nor easy to read. And there is a tricky problem with repeated slashes before the host, the handling of which is base URL dependent.
I have some notes on it here https://github.com/alwinb/reurl/blob/master/doc/grammar.md. (I eventually went with a hybrid approach of a couple of very simple grammars and some logic rules in between). This ties into a model of URLs that I describe here https://github.com/alwinb/reurl/blob/master/doc/theory.md.
What's the status of this? It really does work. I developed the theory when I tried to write a library that supports relative URLs. I am quite confident that it matches the standard (but not everything is described in the notes); as the library now passes all of the parsing tests.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/whatwg/url/issues/479#issuecomment-708482325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE7AOVLPC5M2RVBAW3GK3UDSKXAFHANCNFSM4MQOLQEA .
it would be useful to start with the BNF of RFC 3986 and make changes as necessary. at least to explain the differences and exceptions.
Interesting, I did not know that there were parser generators that support parameterised rules.
I did consider a more formal presentation with subscripted rules, but then I backed off because I thought it would be less accessible. It makes me think of higher order grammars, and I think that's too heavy. I guess in this case it could result in something quite readable too though.
As for the comparison with RFC 3986, it would be great if this can help to point out the differences. I have not looked into that much, but the good news is that it might not be that different, after all. I couldn't start with the RFC though because I was specifically aiming for the WHATWG standard. That was motivated by an assumption that this is the common URL standard, in part because It does mention obsoleting RFC 3986 and RFC 3987 as a goal.
Back to the issue, the question is how this could flow back to the WHATWG standard. And I am not really sure how that would work yet. The parser algorithm seems to be the heart of the standard, and I think there is a lot of work behind that. There is of course the section on URL Writing which does look like a grammar in prose style.
To be clear, what I tried to do, and what I suspect people in this thread (and alike) are after is not to give a grammar for valid URL strings (like in the URL writing section), but to give one that describes the language of URLs that is implicitly defined by the parser algorithm – and in such a way that it also describes their internal structure. Then the grammar contains all the information that you need for building a parser. This is indeed possible, but it is a large change from the standard as it is now.
There is of course the section on URL Writing which does look like a grammar in prose style.
To be clear, what I tried to do, and what I suspect people in this thread (and alike) are after is not to give a grammar for valid URL strings (like in the URL writing section)
I think in fact that there are people who’ve been involved with discussion of this who have actually been hoping for a formal grammar for valid URL strings — in some kind of familiar formalism rather than in the prose style the spec uses. (And to be clear, I’m not personally one of the people who wants that — but from following past discussions around this, I can say I’m certain that’s what at least some people have been asking for.)
but to give one that describes the language of URLs that is implicitly defined by the parser algorithm
I know that’s what some people want but I think as pointed out in https://github.com/whatwg/url/issues/479#issuecomment-708482325 (and #24 and other places) there are some serious challenges in attempting to write such a grammar.
And what I think has also been pointed out in #24 and elsewhere, for anybody who wants that, there’s nothing that prevents them from attempting to write up such a grammar themselves, based on the spec algorithms — but short of that happening, nobody else involved with the development of the spec is volunteering to try to write it up.
The technical issues are mostly solved. I'm willing to help, and I'm looking for some feedback about how to get started.
I cannot just write a separate section because in one way or another it'll compete with the algorithm for normativity, (ao). That's a big problem and I think it is the main reason to resist this.
It also requires specifying the parse tree and some operations on them. That could use the (internal) URL records, but it will require some changes to them.
My main concern is that these things together will trigger too much resistance, too many changes, and that then the effort will fail.
What I can do is try to sketch out some approaches that could help to prevent that. I'll need some time to figure that out. I'm not sure what else I can do to get this going at the moment. Feedback would be appreciated.
I cannot just write a separate section because in one way or another it'll compete with the algorithm for normativity, (ao). That's a big problem and I think it is the main reason to resist this.
Yes — and in other cases in WHATWG specs where there’s been discussion about formalisms for similar cases (e.g., some things in the HTML spec), that rationale (people will end up treating the extra formalism as normative) has been a strong argument against including such formalisms in the specs.
For this case, a grammar could be maintained in a separate (non-WHATWG) repo, and published separately — and then the spec could possibly (non-normatively) link to it (not strictly necessary, but just to help provide awareness it exists).
Agreed with @sideshowbarker, generally. If people want to work on personal projects that provide alternative URL parser formalisms, that's great, and I'm glad we've worked on a test suite to help. As seen from this thread, some folks might appreciate some alternatives more than they appreciate the spec, and so it could be helpful to such individuals. But the spec is good as-is.
There are issues with it that cause further fragmentation right now. I have to say I'm disappointed with this response. I'm trying to help out and solve issues. Not just this one but also #531, #354 amongst others, which cannot be done without a compositional approach. If you do not address that, people come up with ad hoc solutions, creating new corner cases, leading to renewed fragmentation. You can already see this happening in some of the issues. It is also not true that it cannot be done, because I already did it, once for my library and a couple of weeks ago I did a fork of jsdom/whatwg-url over a weekend that uses a modular parser/ resolver based on my notes, has everything in place to start supporting relative URLs as well, and passes all the tests. I didn't post about it, because the changes are too large. Clearly it would not work out. I'm trying to take these concerns into account and work with them. Disregarding that with 'things are fine', I think is a shame.
While I unfortunately do not have the time to contribute to any work on this at the moment, I have a few thoughts.
- First, I agree that care should be taken to avoid confusion about normativity. There definitely should be only one normative spec. If a grammar were to go into the spec itself alongside the algorithm, with the algorithm remaining normative, great care would need to be taken to ensure that they remain accurate as disagreement between the two breeds problems.
- Second, I believe that you already basically have not one, but two alternate semi-normative specifications anyway: the section on writing URLs, which specifies a sort of a grammar on how to write them out, and the test suite. I don't believe that anyone can state with certainty that the section on writing URLs actually matches the parser, and I think this comment by one of the major contributors to the spec goes to show how the test suite is treated basically as normatively as the spec, if not more.
- Third, I am convinced that trying to define a grammar, normative or non-normative, for the spec as it is, is fundamentally a fool's errand.
- But I am not of the opinion that this means that it shouldn't be done. I believe that the current parser should be ripped out entirely, or at least moved to an auxiliary specification on how browsers should implement an actual specification.
To elaborate a bit, I very much disagree with the claim that "the spec is good as is". The spec definitely provides an unambiguous specification with enough information to determine whether or not an implementation meets the specification. This is enough to meet the bare minimum requirements and be an adequate technical standard. But it has a number of flaws that make it difficult to use in practice:
- It conflates domains. This URL specification is primarily geared towards the web and web standards, as is indicated by a lot of the implicit assumptions it makes (see also #535). But the use of URLs, and RFC 3986, extends far beyond the web and the spec does not make any meaningful attempt to address uses outside the web. Recommendations on displaying URLs to users are explicitly applicable only to browsers. It defines an API applicable only to the web, with no discussion of API design for other environments. It canonically defines file as the default scheme when no scheme is specified, when most clients would likely prefer to make that decision themselves.
- The mere fact that the spec is a living standard is not suitable for use in many application domains. It may be acceptable for the web, perhaps, but there are other interchange systems that need a more reliable mechanism.
- It contains almost no background or discussion. It contains only a section listing the goals of the document and three sparse paragraphs on security considerations. It does not explain the purpose of a URL or the human meaning of its various components. It explains almost none of its decisions, such as why special schemes are special or why particular different API setters behave the way they do, or why special schemes get a special, elevated place in the spec to have their scheme-specific parsing requirements incorporated into it.
- It is poorly organized. For instance, it discusses security considerations in sections 4.8 and 1.3 and does not mention this in section 2.
- Most relevantly to the original topic here, it is nearly impossible for a human to reason about whether or not a URL is valid without manually executing the algorithm. It is incredibly opaque. There is no benefit to this. I defer to @sjamaan's excellent comment. I find the suggestion that section 4.3 provides a useful "overview" of the grammar to be ridiculous. It doesn't. It's just as opaque as the rest of the document.
- As an additional point, the opacity of the spec makes it nearly impossible to reason about whether a given behaviour is intentional or a bug. The spec is defined by the implementation in pseudocode. Even understanding the spec's behaviour given an input, much less deciding whether or not it is correct, effectively requires debugging the specification.
- There is no abstraction of related concepts, and there is bad mixing of technical layers between semantics and syntax. Semantic errors are returned during parsing, rather than during a separate step on the parsed values.
It is worth noting that this specification explicitly intends to obsolete RFC 3986. RFC 3986 is a confusing mix of normative and informative text, and a difficult specification to apply and use. Yet this specification is distant from being able to obsolete it because it is targeted entirely at one application domain
In conclusion, this spec is a PHP Hammer. It is not "good". It is barely adequate in the one domain it chooses to support, and abysmal in any other domain.
If the direction of this standard can't reasonably be changed (assuming there are people willing to put in the effort), and in particular if WhatWG is not interested in addressing other domains in this specification. I would be fully supportive of an effort, likely through IETF's RFC process, to design a specification which actually does replace RFC 3986, and to have the WhatWG spec recognized only as the web standard on the implementation of that domain-agnostic URL specification. I will probably direct any energy I do find myself with to address this spec to that project rather than this one.
- Second, I believe that you already basically have not one, but two alternate semi-normative specifications anyway: the section on writing URLs, which specifies a sort of a grammar on how to write them out
To be clear, there’s nothing semi-normative about that https://url.spec.whatwg.org/#url-writing section. It’s normative.
and the test suite.
And to be clear about that: The test suite is not normative.
I don't believe that anyone can state with certainty that the section on writing URLs actually matches the parser
The section on writing URLs doesn’t claim it matches the parser. Specifically: There are known URL cases that the writing-URLs section defines as non-conforming — as far as documents/authors being prohibited for using them— but which have normative requirements that parsers must follow if documents/authors use them anyway.
and I think this comment by one of the major contributors to the spec goes to show how the test suite is treated basically as normatively as the spec, if not more.
While some people may treat the test suite as authoritative for some purposes, it’s not normative. In the URL spec and other WHATWG specs, normative is a term of art used consistently with an exact unambiguous meaning: it applies only to the spec, and specifically only to the spec language that states actual requirements (e.g., using RFC 2119 must, must not, etc., wording).
The test suite doesn’t state requirements; instead it tests the normative requirements in the spec. And if the test suite were to test something which the spec doesn’t explicitly require, then the test suite would be out of conformance with the spec.
- Most relevantly to the original topic here, it is nearly impossible for a human to reason about whether or not a URL is valid without manually executing the algorithm.
The algorithm doesn’t define whether a URL is valid or not; instead the algorithm defines how a URL must be processed, whether or not the https://url.spec.whatwg.org/#url-writing section defines that URL as valid/conforming.
Note also that the URL spec has multiple conformance classes for which it states normative requirements; its algorithms state one set of requirements for parsers as a conformance class, and separately, the https://url.spec.whatwg.org/#url-writing section states a different set of requirements for documents/authors as a conformance class.
I'm well aware that the test suite is not normative, and that the writing spec is normative, and of the use of "normative" as a term of art. But you said:
Yes — and in other cases in WHATWG specs where there’s been discussion about formalisms for similar cases (e.g., some things in the HTML spec), that rationale (people will end up treating the extra formalism as normative) has been a strong argument against including such formalisms in the specs.
You claimed that people treating the extra formalism as normative is an argument against the inclusion, not that it would create two potentially-contradictory normative texts.
By the same argument, you should remove the URL writing spec, because it risks being treated as normative, and consider retiring the test suite as well because people treat it as normative because the spec itself is incomprehensible.
I don't think that you should remove either of them. I think you should make the spec comprehensible so that people stop being tempted to treat something else as normative.
The section on writing URLs doesn’t claim it matches the parser.
I agree that it does not claim to produce invalid URLs. It does, however, make a claim that the operation of serialization is reversible by the parser:
The URL serializer takes a URL and returns an ASCII string. (If that string is then parsed, the result will equal the URL that was serialized.)
Admittedly, this claim is rather suspect because it then provides many examples of where that is not true. I suspect it is missing some qualifiers, such as that the serialization must succeed and the parsing must be done with no base URL and no encoding override.
Even with those qualifiers added, I challenge you to produce a formal proof that serialization and parsing produces an equal URL.
Thank you @alercah. I feel validated by the statement that I have been running a fools errand. It is nice that someone understands the issues and the amount of work that it involves.
The only reason I pushed through was because I had made a commitment to myself that I would finish this project.
RFC 3986 is probably what you want.
No, I want an end to the situation where an essential building block of the internet has two incompatible specifications that cannot be unified because the one that describes the behaviour of web browsers does so by a long stretch of convoluted pseudocode to describe a monolith function that mixes parsing with normalisation, resolution, percent encoding and updates to URL components. Indeed, an update to RFC 3986 to include browser behaviour would be really, really great. Unfortunately that requires reverse engineering this standard.
I want an end to the situation where an essential building block of the internet has two incompatible specifications that cannot be unified
I tried over many years to resolve this issue. @sideshowbarker @rubys @royfielding @duerst can attest, See https://tools.ietf.org/html/draft-ruby-url-problem-01 from 2015
Work has started. This is going to happen. Stay tuned.
There is a GitHub project page for a rephrased specification here. It can be viewed online here.
Whilst still incomplete, it is coming along quite nicely. The key section on Reference Resolution is complete. The formal grammars are nearly complete. There is also a reference implementation of the specification here.
It will not be hard to add a normative section on browser behaviour to e.g. RFC 3986/ RFC 3987 once this is finished. The differences are primarily around the character sets and the multiples of slashes before the authority. The latter is taken care of by the forced-resolution as described in that section Reference Resolution.
This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.
Following up on this:
This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.
I have done more research, esp. around the character sets, making some tools to compute the differences. These are my findings. I will follow up with a post about other, minor grammar changes and reference resolution.
The differences will be very small, after all is said and done. Which is great!
Character Sets
IRI vs WHATWG URL
The codepoints allowed in the components of valid WHATWG URLs are almost the same as in RFC3987 IRIs. There is only one difference:
- WHATWG URLs allow more non-ASCII unicode code points in components.
Specifically, the WHATWG Standard allows the additional codepoints:
- The Private Use Areas: { u+E000-u+F8FF, u+F0000-u+FFFFD, u+100000-u+10FFFD }.
- Specials, minus the non-characters: { u+FFF0-u+FFFD }
- Tags and variation selectors, specifically, { u+E0000-u+E0FFF }.
Specials are allowed in the query part of an IRI, not in the other components though.
IRI vs loose-WHATWG URL
Let me call any input that the 'basic url parser' accepts as a single argument, a 'loose-WHATWG URL'.
Note: The IRI grammar does not split the userinfo into a username and password, but RFC3986 (URI) suggests in 3.2.1. that the first : separates the username from the password. So I assume this in what follows. Note though that valid WHATWG URLs do not allow username and password components at all.
To go from IRIs to loose WHATWG URLs, allow any non-ASCII unicode code point in components, and a number of additional ASCII characters as well. Let's define iinvalid:
iinvalid := { u+0-u+1F, , ", <, >, [, ], ^, `, {, |, }, u+7F }
Then, for the components:
- username: add iinvalid and
@(but remove:). - password: add iinvalid and
@. - opaque-host: add a subset of iinvalid: { u+1-u+8, u+B-u+C, u+E-u+1F,
",`,{,}, u+7F } - path component: Add iinvalid.
- query: add iinvalid.
- fragment: add iinvalid and
#. - For non-special loose WHATWG URLs also add
\to all the above except for opaque-host.
The grammar would have to be modified to allow invalid percent escape sequences: a single % followed by zero or one hexdigits, (but not two).
Note that the WHATWG parser removes tabs and newlines { u+9, u+A, u+D } in a preprocessing pass, so you may choose to exclude those from the iinvalid set. Preprocessing also removes leading and trailing sequences of { u+0-u+20 } (aka c0-space), but it's not a good idea to try and express that in the grammar.
I've suggested a BOF session at IETF 111, which will be held online, to consider what changes to IETF specs would, in conjunction with WHATWG specs, would resolve this issue. A BOF is not a working group, but rather a precursor. to evaluate whether there is enough energy to start one. IETF attendance fees can be waved. https://mailarchive.ietf.org/arch/msg/dispatch/i3_t-KjapMhFPCIoQe1N47buZ5M/
In case a new IETF effort does get started,
I just want to state that I hope, and actually believe that a new/ updated IETF specification and the WHATWG URL standard could complement each other quite well. It will require work and there will be problems, but it is possible and worthwhile.
@alwinb Nothing in IETF will happen unless you show up with friends willing to actually do the work.
So here's some work to be done.
- [ ] Decide if an addendum is enough, or if RFC3986/39876 should be merged (the latter has my preference)
- [x] Decide if the full WHATWG parsing/resolution behaviour should be included, or if it is enough to provide the elementary operations that can then be recombined in the WHATWG standard to exactly reproduce their current behaviour (latter one has my preference, then the standards can really be complementary!)
- [x] Decide how to include the loose grammar in such a document (my preference: parameterise the character sets)
- [ ] Rewrite my 'force' operation into the RFC style and maybe refactor the merge operations from RFC3986 a little, or switch to my model of sequences more whole heartedly.
- [ ] Amend or parameterise the 'path merge' to support the WHATWG percent-encoded dotted segments.
- [ ] A remaining technical issue: solve #574, and figure out how to incorporate that into the RFC grammar
- [ ] Decide what to do with the numbers in the ip-addresses of the loose grammar, esp. how to express their allowed range (ie. on the grammatical level as in RFC3986 or on a semantic level)
- [ ] Preferably, find implementations of the existing RFCs, work with them to implement the additions and have them test agains the wpt test suite, to corroborate that the additions can be combined to express the WHATWG behaviour
- [ ] Expand the wpt test suite to include validity tests (!!)
- [ ] Write about the encoding-normal form, parameterise it by component-dependent character sets, so that the percentEncodeSets of the WHATWG standard can be plugged into the comparison ladder nicely.
- [ ] For the WHATWG standard: decide if a precomposed version of the 'basic-url-parser' should be kept or if it should be split up. It may be possible to automatically generate a precomposed version from an implementation of the elementary operations, and to also automatically generate the pseudocode from that.
Let's get started!
I responded on the "dispatch" list: what's the minimum amount of work that will improve the lack of clarity of what spec is what? (MNot's suggestion). Second, what is the minimum to resolve the differences in normative specifications? Once the specs are aligned normatively, you can do everything else. Step 0: host a BOF at IETF 111 with stakeholders. (Get people to show up and agree to do work.)
@masinter Thank you. I think that is a good strategy, but with an aside. It is dangerous to apply the IETF level of accuracy and exactness to this too soon. Rather it has been tearing apart, digesting, recomposing/ refactoring what the WHATWG has produced, and now trying to relate it to what was there before.
what is the minimum to resolve the differences in normative specifications? Once the specs are aligned normatively, you can do everything else.
I got carried away, but some of the things I mentioned do need to be done, otherwise you cannot make that comparison. Or perhaps, those items are my answer to that question.
Have you studied my reverse specification? Have you checked it against the WHATWG standard and your knowledge of the RFCs? Do you have comments or ideas? Doing so should enable you to answer this question as well.
Step 0: host a BOF at IETF 111 with stakeholders. (Get people to show up and agree to do work.)
I'm a bit intimidated, but, it sounds good.
Oh except the work thing. I was taken aback, because I've done so much work on this already, and still, and I am kind of tired. Also, I find the political situation unpleasant. I don't want to pick sides. I just want to solve the situation.