rfc3986
rfc3986 copied to clipboard
Incorrect result returned by is_valid_uri()
According to RFC 3986, there are some ABNF rules:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / ".")
And I tried:
>>> from rfc3986 import is_valid_uri
>>> is_valid_uri("1")
True
>>> is_valid_uri("//")
True
>>> is_valid_uri("http#:/")
True
I think these results are not correct.
Really this method should be is_valid_uri_reference(), but each of these are valid URI references.
# A no-scheme path with zero extra segments
>>> rfc3986.uri_reference('1')
URIReference(scheme=None, authority=None, path='1', query=None, fragment=None)
# A relative-ref with an empty path and no scheme.
>>> rfc3986.uri_reference('//')
URIReference(scheme=None, authority=None, path=None, query=None, fragment=None)
# A relative-ref with a non-empty path and a fragment.
>>> rfc3986.uri_reference('http#:/')
URIReference(scheme=None, authority=None, path='http', query=None, fragment=':/')
Right. There's a far better and more comprehensive way of doing validations
https://rfc3986.readthedocs.io/en/latest/user/validating.html
rfc3986.is_valid_uri("Hi Super Nintendo Chalmers") # -> True
rfc3986.is_valid_uri("My cat's breath smells like cat food.") # -> True
rfc3986.is_valid_uri("I'm learnding!") # -> True
Something is deeply wrong here.
@bugeats All of those examples pass the validation check because URIReference.from_string() does normalization which converts characters within components that can be percent-encoded into their percent-encoded counterparts before calling is_valid() (Meaning all of that text ends up being a percent encoded path). And because there aren't any kwargs to "require" any component in particular we get a result of True.
Using the Validator interface makes this all more clear.
You can achieve what I believe you want via:
def is_valid_uri_with_scheme_and_host(uri):
try:
uri = rfc3986.URIReference.from_string(uri)
rfc3986.validators.Validator().require_presence_of("scheme", "host").check_validity_of("scheme", "host").validate(uri)
return True
except MissingComponentError, InvalidComponentsError:
return False
Will return False for all your examples but True for https://www.google.com/ and the like.
Yes I eventually figured that out, thanks for the detailed response. I mostly wanted to point out the absurdity of the code as it reads. I humbly suggest that you reconsider the name of that method.
Or just remove it, it's been deprecated for a few releases.
I want to use this library to detect foo bar (with a space) as an invalid URI with regard to RFC 3986. All parts of the API of this library that I have seen so far auto-correct foo bar to foo%20bar for me. How can I disable that behavior and get an exception, instead? What class and functions can I use for strict parsing behavior if is_valid_uri and .is_valid are not strict? Thanks!
Anyone?
Almost a year late to the party here.
If you don't mind using something home-grown, you could try using it. I went through RFC3986 and converted all the ABNF to regexes. I left the ABNF in comments above the regexes so you can verify the code for yourself.