http icon indicating copy to clipboard operation
http copied to clipboard

URI parser cannot parse URNs

Open Reisen opened this issue 4 years ago • 4 comments

It seems the URI parser stumbles with URNs. A simple test to reproduce:

fn main() {
    use http::uri::Uri;
    let uri = "urn:isbn:1238492".parse::<Uri>();
    println!("{:?}", uri);
}

Produces the output:

$ ./testcase
Err(InvalidUri(InvalidAuthority))

The authority is only present when the scheme is followed by //, otherwise the parser should be parsing everything after the first : as the path. Python gets this right, and Go parses this as Opaque (instead of Host, but still successfully parses):

Python:

>>> urlparse('urn:isbn:0451450523')
ParseResult(scheme='urn', netloc='', path='isbn:0451450523', params='', query='', fragment='')

Go:

>>> url, _ := url.Parse("urn:isbn:0451450523")
>>> fmt.Println("Host (Opaque):", url.Opaque)
Host (Opaque): isbn:0451450523

Reisen avatar Jan 02 '20 17:01 Reisen

The main challenge in implementing this is that the Uri parser allows inputs with no scheme. For example, the parser parses localhost:3000 as having no scheme and an authority of localhost:3000. This means that some inputs are ambiguous. As a hypothetical example, for the input custom:3000, it is unclear if the scheme is custom and the path is 3000, or if there is no scheme and the authority is custom:3000.

The quickest fix would be to special-case the urn scheme (http and https are already special-cased). Unfortunately this would not work with custom schemes such as in the previous example.

Is there a use case that would require supporting URNs?

yifanmai avatar Feb 01 '20 23:02 yifanmai

So URNs are actually designed to be valid URIs so if the parser respected RFC 3986 they should already automatically be parseable and not need any extra support. The issue here is the current URI parser implementation isn't correct.

In your examples both localhost and custom would unambiguously be the scheme and for a URI with no scheme you would have to enter //localhost:3000 (see here).

Other languages do get this right:

Python:

>>> urlparse('localhost:8000')
ParseResult(scheme='localhost', netloc='', path='3000', params='', query='', fragment='')
>>> urlparse('//custom:3000')
ParseResult(scheme='', netloc='custom:3000', path='', params='', query='', fragment='')

Go:

>>> url, _ := url.Parse("custom:3000")
Opaque: 3000 , Host:  , Scheme: custom , Path: 
>>> url, _ := url.Parse("//custom:3000")
Opaque:  , Host: custom:3000 , Scheme:  , Path: 

My personal use case is that I receive URIs from a third party of which URNs are mixed into the incoming data. I reached for a URI implementation expecting it to work and in the end switched to different language to write the tool.

Reisen avatar Feb 03 '20 05:02 Reisen

Strangely enough, I get a different result for Python (2.7.17rc1 and 3.7.5rc1):

>>> urlparse('localhost:8000')
ParseResult(scheme='', netloc='', path='localhost:8000', params='', query='', fragment='')
>>> urlparse('//custom:3000')
ParseResult(scheme='', netloc='custom:3000', path='', params='', query='', fragment='')

While I agree with you regarding RFC 3986 conformance, the current parser is already non-conformant and does something similar to the Python implementation. So changing the behavior now might have undesirable effects on downstream hyper clients.

yifanmai avatar Feb 06 '20 03:02 yifanmai

So after doing a bit of digging it seems that Python had special handling for ports which caused it to break when paths are numeric. It's also doing the wrong thing there but they did in fact fix this. Very recently it seems:

https://github.com/python/cpython/pull/16839 https://github.com/python/cpython/pull/661

It's likely not being backported to older releases for exactly the reason you mention, as peoples code would suddenly start breaking on only a minor release bump.

I think this should still be fixed and simply be included in a major release. Otherwise this library essentially empowers the Rust ecosystem to do the wrong thing. If Python can fix this after 20 years, it's definitely not too late here 🙂

I'm happy to try and make the changes myself if the change is considered acceptable by the hyper maintainers.

Reisen avatar Feb 11 '20 16:02 Reisen