w3lib canonicalize_url breaks certain url(s)

canonicalize_url breaks certain url(s)

Open markbaas opened this issue 9 years ago • 5 comments

The url /cmp/Supermercados-Dia%25 is incorrectly unquoted into /cmp/Supermercados-Dia%

Problem happens in def _unquotepath(path): for reserved in ('2f', '2F', '3f', '3F'): path = path.replace('%' + reserved, '%25' + reserved.upper()) return urllib.unquote(path)

Mar 06 '15 15:03 markbaas

Happens with urls containing "%26" (&) as well.

Apr 02 '15 06:04 alisufian

From my unscientific tests, with this page,

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<base href="  http://www.example.com/">
<title>No title</title>
</head>

<body>

<a href="/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F">"/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F", relative to base http://www.example.com/</a><br />
<a href="/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F">"/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F", relative to base http://www.example.com/</a><br />
<a href="/%30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F">"/%30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F", relative to base http://www.example.com/</a><br />
<a href="/%40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F">"/%40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F", relative to base http://www.example.com/</a><br />
<a href="/%50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F">"/%50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F", relative to base http://www.example.com/</a><br />
<a href="/%60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F">"/%60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F", relative to base http://www.example.com/</a><br />
<a href="/%70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F">"/%70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F", relative to base http://www.example.com/</a><br />

</body>

</html>

these are the URL that my Chrome browser (Version 53.0.2785.113 (64-bit) on Ubuntu) fetches, as seen in the network tab:

http://www.example.com/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
http://www.example.com/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,-,.,%2F
http://www.example.com/0,1,2,3,4,5,6,7,8,9,%3A,%3B,%3C,%3D,%3E,%3F
http://www.example.com/%40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
http://www.example.com/P,Q,R,S,T,U,V,W,X,Y,Z,%5B,%5C,%5D,%5E,_
http://www.example.com/%60,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
http://www.example.com/p,q,r,s,t,u,v,w,x,y,z,%7B,%7C,%7D,~,%7F

Sep 14 '16 16:09 redapple

Summary for Chrome vs. canonicalize_url:

>>> from w3lib.url import canonicalize_url
>>> 
>>> chrome_normalized = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
... %20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,-,.,%2F
... 0,1,2,3,4,5,6,7,8,9,%3A,%3B,%3C,%3D,%3E,%3F
... %40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
... P,Q,R,S,T,U,V,W,X,Y,Z,%5B,%5C,%5D,%5E,_
... %60,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
... p,q,r,s,t,u,v,w,x,y,z,%7B,%7C,%7D,~,%7F'''
>>> 
>>> raw_in_html = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
... %20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F
... %30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F
... %40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F
... %50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F
... %60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F
... %70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F'''
>>> 
>>> raw_lines = raw_in_html.splitlines()
>>> norm_lines = chrome_normalized.splitlines()
>>> 
>>> for i, line in enumerate(raw_lines):
...     raw_chars = line.split(',')
...     norm_chars = norm_lines[i].split(',')
...     for pos, c in enumerate(raw_chars):
...         canonicalized = canonicalize_url(c)
...         if c == norm_chars[pos]:
...             if c != canonicalized:
...                 print('{0} was preserved by Chrome, but canonicalize_url("{0}") unquoted it to {1}'.format(c, canonicalized))
...         
... 
%21 was preserved by Chrome, but canonicalize_url("%21") unquoted it to !
%23 was preserved by Chrome, but canonicalize_url("%23") unquoted it to #
%24 was preserved by Chrome, but canonicalize_url("%24") unquoted it to $
%25 was preserved by Chrome, but canonicalize_url("%25") unquoted it to %
%26 was preserved by Chrome, but canonicalize_url("%26") unquoted it to &
%27 was preserved by Chrome, but canonicalize_url("%27") unquoted it to '
%28 was preserved by Chrome, but canonicalize_url("%28") unquoted it to (
%29 was preserved by Chrome, but canonicalize_url("%29") unquoted it to )
%2A was preserved by Chrome, but canonicalize_url("%2A") unquoted it to *
%2B was preserved by Chrome, but canonicalize_url("%2B") unquoted it to +
%2C was preserved by Chrome, but canonicalize_url("%2C") unquoted it to ,
%3A was preserved by Chrome, but canonicalize_url("%3A") unquoted it to :
%3B was preserved by Chrome, but canonicalize_url("%3B") unquoted it to ;
%3D was preserved by Chrome, but canonicalize_url("%3D") unquoted it to =
%40 was preserved by Chrome, but canonicalize_url("%40") unquoted it to @
%7C was preserved by Chrome, but canonicalize_url("%7C") unquoted it to |

Sep 14 '16 16:09 redapple

For Firefox (48.0 Mozilla Firefox for Ubuntu) it's a bit different:

"on the wire" as copied from the network panel:

http://www.example.com/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
http://www.example.com/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,.,%2F
http://www.example.com/%30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F
http://www.example.com/%40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F
http://www.example.com/%50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F
http://www.example.com/%60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F
http://www.example.com/%70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F

as displayed in the address bar:

www.example.com/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
www.example.com/ ,!,",%23,%24,%25,%26,',(,),*,%2B,%2C,-,.,%2F
www.example.com/0,1,2,3,4,5,6,7,8,9,%3A,%3B,<,%3D,>,%3F
www.example.com/%40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
www.example.com/P,Q,R,S,T,U,V,W,X,Y,Z,[,\,],^,_
www.example.com/`,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
www.example.com/p,q,r,s,t,u,v,w,x,y,z,{,|,},~,%7F

Summary using the URL bar data as output:

>>> from w3lib.url import canonicalize_url
>>> 
>>> raw_in_html = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
... %20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F
... %30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F
... %40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F
... %50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F
... %60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F
... %70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F'''
>>> 
>>> firefox_normalized = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
...  ,!,",%23,%24,%25,%26,',(,),*,%2B,%2C,-,.,%2F
... 0,1,2,3,4,5,6,7,8,9,%3A,%3B,<,%3D,>,%3F
... %40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
... P,Q,R,S,T,U,V,W,X,Y,Z,[,\,],^,_
... `,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
... p,q,r,s,t,u,v,w,x,y,z,{,|,},~,%7F'''
>>> 
>>> raw_lines = raw_in_html.splitlines()
>>> norm_lines = firefox_normalized.splitlines()
>>> 
>>> for i, line in enumerate(raw_lines):
...     raw_chars = line.split(',')
...     norm_chars = norm_lines[i].split(',')
...     for pos, c in enumerate(raw_chars):
...         canonicalized = canonicalize_url(c)
...         if c == norm_chars[pos]:
...             if c != canonicalized:
...                 print('{0} was preserved by Firefox, but canonicalize_url("{0}") unquoted it to {1}'.format(c, canonicalized))
...         
... 
%23 was preserved by Firefox, but canonicalize_url("%23") unquoted it to #
%24 was preserved by Firefox, but canonicalize_url("%24") unquoted it to $
%25 was preserved by Firefox, but canonicalize_url("%25") unquoted it to %
%26 was preserved by Firefox, but canonicalize_url("%26") unquoted it to &
%2B was preserved by Firefox, but canonicalize_url("%2B") unquoted it to +
%2C was preserved by Firefox, but canonicalize_url("%2C") unquoted it to ,
%3A was preserved by Firefox, but canonicalize_url("%3A") unquoted it to :
%3B was preserved by Firefox, but canonicalize_url("%3B") unquoted it to ;
%3D was preserved by Firefox, but canonicalize_url("%3D") unquoted it to =
%40 was preserved by Firefox, but canonicalize_url("%40") unquoted it to @
>>>

Sep 14 '16 16:09 redapple

This needs to be moved to w3lib, the new home of canonicalize_url

Sep 16 '16 11:09 redapple

w3lib w3lib copied to clipboard

canonicalize_url breaks certain url(s)

w3lib
w3lib copied to clipboard