http
http copied to clipboard
URI parser cannot parse URNs
It seems the URI parser stumbles with URNs. A simple test to reproduce:
fn main() {
use http::uri::Uri;
let uri = "urn:isbn:1238492".parse::<Uri>();
println!("{:?}", uri);
}
Produces the output:
$ ./testcase
Err(InvalidUri(InvalidAuthority))
The authority is only present when the scheme is followed by //, otherwise the parser should be parsing everything after the first : as the path. Python gets this right, and Go parses this as Opaque (instead of Host, but still successfully parses):
Python:
>>> urlparse('urn:isbn:0451450523')
ParseResult(scheme='urn', netloc='', path='isbn:0451450523', params='', query='', fragment='')
Go:
>>> url, _ := url.Parse("urn:isbn:0451450523")
>>> fmt.Println("Host (Opaque):", url.Opaque)
Host (Opaque): isbn:0451450523
The main challenge in implementing this is that the Uri parser allows inputs with no scheme. For example, the parser parses localhost:3000 as having no scheme and an authority of localhost:3000. This means that some inputs are ambiguous. As a hypothetical example, for the input custom:3000, it is unclear if the scheme is custom and the path is 3000, or if there is no scheme and the authority is custom:3000.
The quickest fix would be to special-case the urn scheme (http and https are already special-cased). Unfortunately this would not work with custom schemes such as in the previous example.
Is there a use case that would require supporting URNs?
So URNs are actually designed to be valid URIs so if the parser respected RFC 3986 they should already automatically be parseable and not need any extra support. The issue here is the current URI parser implementation isn't correct.
In your examples both localhost and custom would unambiguously be the scheme and for a URI with no scheme you would have to enter //localhost:3000 (see here).
Other languages do get this right:
Python:
>>> urlparse('localhost:8000')
ParseResult(scheme='localhost', netloc='', path='3000', params='', query='', fragment='')
>>> urlparse('//custom:3000')
ParseResult(scheme='', netloc='custom:3000', path='', params='', query='', fragment='')
Go:
>>> url, _ := url.Parse("custom:3000")
Opaque: 3000 , Host: , Scheme: custom , Path:
>>> url, _ := url.Parse("//custom:3000")
Opaque: , Host: custom:3000 , Scheme: , Path:
My personal use case is that I receive URIs from a third party of which URNs are mixed into the incoming data. I reached for a URI implementation expecting it to work and in the end switched to different language to write the tool.
Strangely enough, I get a different result for Python (2.7.17rc1 and 3.7.5rc1):
>>> urlparse('localhost:8000')
ParseResult(scheme='', netloc='', path='localhost:8000', params='', query='', fragment='')
>>> urlparse('//custom:3000')
ParseResult(scheme='', netloc='custom:3000', path='', params='', query='', fragment='')
While I agree with you regarding RFC 3986 conformance, the current parser is already non-conformant and does something similar to the Python implementation. So changing the behavior now might have undesirable effects on downstream hyper clients.
So after doing a bit of digging it seems that Python had special handling for ports which caused it to break when paths are numeric. It's also doing the wrong thing there but they did in fact fix this. Very recently it seems:
https://github.com/python/cpython/pull/16839 https://github.com/python/cpython/pull/661
It's likely not being backported to older releases for exactly the reason you mention, as peoples code would suddenly start breaking on only a minor release bump.
I think this should still be fixed and simply be included in a major release. Otherwise this library essentially empowers the Rust ecosystem to do the wrong thing. If Python can fix this after 20 years, it's definitely not too late here 🙂
I'm happy to try and make the changes myself if the change is considered acceptable by the hyper maintainers.