sttp icon indicating copy to clipboard operation
sttp copied to clipboard

sttp2: Uri.parse does not fail on invalid URIs

Open bmarinovic opened this issue 5 years ago • 11 comments

The safe Uri.parse function does not fail on invalid URIs.

Examples:

  • Uri.parse(" ") returns Right(http://%20)
  • Uri.parse("https:/ /uri") returns Right(http://https%3A%2F%20%2Furi)

Expected: Left("Invalid URI") or some similar message

Also, not sure if this is expected behavior, but this prefixes invalid protocol with valid protocol and also succeeds:

  • Uri.parse("ttps:/www.google.com") returns Right(http://ttps%3A%2Fwww.google.com)

bmarinovic avatar Jan 06 '20 13:01 bmarinovic

I understand why this might be surprising, but should Uri.parse("somehost") parse as http://somehost? That is, if unspecified we use the http protocol, and assume the first token is a host?

The same is happening with Uri.parse(" "). The string is encoded (to %20) and this is assumed to be the host.

adamw avatar Jan 07 '20 10:01 adamw

Shouldn't we trim the input before encoding?

ghostbuster91 avatar Jan 07 '20 12:01 ghostbuster91

That would solve " " (hopefully ;) ). Still, wouldn't solve https:/ /uri

adamw avatar Jan 07 '20 12:01 adamw

My issue here https://github.com/softwaremill/sttp-model/issues/2 seems to be pointing at the same surprising behavior.

I didn't realize that uri"abcdef" would default to http://abcdef. Because of this, unfortunately my go to way of validating is: uri"${URI.create(s).toASCIIString}"

I think that it's handy to provide a default scheme (which begs the question – why not https?), but maybe it's also good to have a stricter parser that would require the scheme to be present. This makes more sense when using Uri for decoding or validating user input. What do you think?

kusamakura avatar Jan 21 '20 08:01 kusamakura

Thinking about this issue, I'm leaning towards requiring the scheme. As you said, why should http be the default, and not https. I think it's a sane requirement, and as you write, having this implied is suprising.

adamw avatar Jan 21 '20 14:01 adamw

Of course PRs are welcome :) This would probably be a change both to the interpolator and Uri.parse, which I think simply calls the interpolator.

adamw avatar Jan 21 '20 14:01 adamw

@adamw I'd love to take on this issue if it's alright :)

kusamakura avatar Jan 21 '20 15:01 kusamakura

Sure, go ahead :)

adamw avatar Jan 21 '20 15:01 adamw

I think I got the uri" " case down by trimming the parts of the string context in the interpolator, but work for scheme validation seems a bit more involved. For instance, I expect mailto:[email protected] to work just fine. Right now, it's treated as http://mailto:[email protected]. With softwaremill/sttp-model#3, it's mailto://[email protected], which works fine (just as https:github.com works fine), but is still not ideal – there's no authority for mailto, and the email address is just the path.

I found out that the // part is optional, and denotes the optional authority if it's present. However, some URI schemes don't require an authority, such as mailto. For the interpolator to know which ones require that, we'll have to add a list of common schemes. Considering the context in which this library is used, I think it's an okay compromise to support the usual cases (http, https, ftp, sftp, etc.), but that may also not be good enough. What do you think, @adamw?

kusamakura avatar Jan 24 '20 07:01 kusamakura

Sure, makes sense. Would be great if we could also make it work well for ftp/mailto etc. I suppose by default the authority would be optional, with special-casing for http, ftp which would require it?

Btw.: in the mailto:[email protected] case, I suppose the uri consists of the scheme and a single path component, equal to [email protected]?

adamw avatar Jan 24 '20 14:01 adamw

Sorry for the late reply. Yeah, I was thinking that the authority should be optional by default, and then we can have a list of schemes that require it.

And yes, for mailto, it's exactly that: scheme + path. That's usually the case for custom URIs as well, such as isbn or news.

So the work I think will be in the tokenizer, where we start with scheme, and then branch off the tokenizers depending on whether the scheme requires an authority or not. Those that require one will branch into the authority/userinfo tokenizer, while those that don't will branch into the path tokenizer.

kusamakura avatar Jan 27 '20 08:01 kusamakura

As first case handling is implemented in https://github.com/softwaremill/sttp-model/pull/194 and meanwhile we added relative URL handling - second case is valid thought contains unnecessary empty space. If we start adding such corner-cases we'll end up with 1000 if-s which is not the best option in the long term.

Pask423 avatar Sep 15 '22 17:09 Pask423