addressable icon indicating copy to clipboard operation
addressable copied to clipboard

Bug with normalization/unencode and leave_encoded

Open DreadPirateShawn opened this issue 9 years ago • 3 comments

Normalization breaks superscripts in a URL path.

Consider http://en.wiktionary.org/wiki/³ which is distinctly different from http://en.wiktionary.org/wiki/3 -- normalize will convert the former into the latter.

> require 'addressable/template'
 => true

> Addressable::URI.parse("http://en.wiktionary.org/wiki/³")
 => #<Addressable::URI:0x500b93c URI:http://en.wiktionary.org/wiki/³>

> Addressable::URI.parse("http://en.wiktionary.org/wiki/³").normalize
 => #<Addressable::URI:0x500f014 URI:http://en.wiktionary.org/wiki/3>

> Addressable::URI.unencode("http://en.wiktionary.org/wiki/%C2%B3")
 => "http://en.wiktionary.org/wiki/³"

> Addressable::URI.parse("http://en.wiktionary.org/wiki/%C2%B3").normalize
 => #<Addressable::URI:0x50290c2 URI:http://en.wiktionary.org/wiki/3>

I also tried to normalize the path directly (so that I could pass the leave_encoded parameter), but that did not work either -- as you can see in the latter examples, the leave_encoded parameter was respected (the ampersand remains encoded) but the superscript was not (still changes to a regular 3).

> require 'addressable/template'
 => true

> Addressable::URI.normalize_component("/wiki/³", leave_encoded=/[³]/)
 => "/wiki/3"

> Addressable::URI.normalize_component("/wiki/%C2%B3", leave_encoded=/[³]/)
 => "/wiki/3"

> Addressable::URI.normalize_component("/wiki/³%26³")
 => "/wiki/3&3"

> Addressable::URI.normalize_component("/wiki/³%26³", leave_encoded=/[&³]/)
 => "/wiki/3%263"

> Addressable::URI.normalize_component("/wiki/%C2%B3%26%C2%B3", leave_encoded=/[&³]/)
 => "/wiki/3%263"

This may be related to issue #100, or at least is likely related to the same section of code.

DreadPirateShawn avatar May 09 '15 19:05 DreadPirateShawn

The bug here is with leave_encoded. See http://intertwingly.net/blog/2004/07/31/URI-Equivalence and referenced discussion for why this behavior is correct in the absence of leave_encoded.

sporkmonger avatar Nov 04 '16 00:11 sporkmonger

Ran into this issue and seems like it's still around. Actual: Addressable::URI.unencode_component("%E2%84%A2", String, "%E2%84%A2") => "™"

Expected: Addressable::URI.unencode_component("%E2%84%A2", String, "%E2%84%A2") => "%E2%84%A2"

@sporkmonger I know this issue is super old, but do you know if there was any attempt to fix it?

AnthonyClark avatar Apr 26 '19 21:04 AnthonyClark

@AnthonyClark I don't think there's been any attempt to address this (links to the blame views: unencode_component, normalize_component)

dentarg avatar Mar 14 '21 19:03 dentarg