addressable icon indicating copy to clipboard operation
addressable copied to clipboard

Normalization of path segments should probably happen before normalization of percent escaping

Open sporkmonger opened this issue 14 years ago • 7 comments

Addressable::URI.parse("/%2E/").normalize.to_str.should == "/%2E/"

sporkmonger avatar Mar 20 '10 06:03 sporkmonger

This issue probably requires a check-in with the IETF URI mailing list before deciding one way or the other.

sporkmonger avatar Mar 20 '10 06:03 sporkmonger

I understand that it's been a long time ago, but still wanted to check in to see what's up with this issue? We've hit this bug in a bit different context and are not sure how to deal with it. Any chance this going to be fixed?

kovyrin avatar Oct 18 '13 01:10 kovyrin

Could you elaborate on the issue you're hitting? A test case would be awesome.

sporkmonger avatar Oct 18 '13 14:10 sporkmonger

Actually, now I'm not sure if our issue is related to this one. Here is our problem:

irb(main):001:0> Addressable::URI.parse(PostRank::URI.unescape("http://foo.com/blah%ef%bc%9f"))
=> #<Addressable::URI:0x5648890 URI:http://foo.com/blah?>
irb(main):002:0> Addressable::URI.parse(PostRank::URI.unescape("http://foo.com/blah%ef%bc%9f")).normalize!
=> #<Addressable::URI:0x564ed08 URI:http://foo.com/blah%3F>

Normalize call screws up a perfectly valid (AFAIU) unicode symbol and replaces it with a latin1 question mark.

kovyrin avatar Oct 18 '13 16:10 kovyrin

It's doing the right thing actually. IRIs (unicode-friendly URIs) use unicode normalization form KC to limit phishing. NFKC tends to do perceptual codepoint conversions, like converting '?' to '?'. The solution here is not to normalize the URI if this is causing a problem, or to instead normalize components piecemeal. "http://foo.com/blah%ef%bc%9f" and "http://foo.com/blah%3F" are considered equivalent.

sporkmonger avatar Oct 20 '13 14:10 sporkmonger

Some more context, %2E is .

irb(main):038:0> CGI.unescapeURIComponent "%2E"
=> "."

Addressable::URI.parse("/%2E/").normalize.to_str.should == "/%2E/"

Not sure why this should be true? If you want to compare URIs, shouldn't you normalize both before comparing?


Hmm, from https://www.rfc-editor.org/rfc/rfc3986#section-2.3

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

Does this mean that Addressable::URI.parse("/%2E/") should be turned into Addressable::URI.parse("/./") directly at #parse?

Normalization removes the dot and the trailing slash

irb(main):042:0> Addressable::URI.parse("/%2E/").normalize.to_s
=> "/"
irb(main):044:0> Addressable::URI.parse("/./").normalize.to_s
=> "/"

dentarg avatar Jul 19 '23 08:07 dentarg

Does this mean that Addressable::URI.parse("/%2E/") should be turned into Addressable::URI.parse("/./") directly at #parse?

That would go against what's suggested in https://github.com/sporkmonger/addressable/issues/477

dentarg avatar Jul 19 '23 08:07 dentarg