linkchecker icon indicating copy to clipboard operation
linkchecker copied to clipboard

To much quoting of URL fragments

Open pmhahn opened this issue 10 years ago • 1 comments

I have several documents, where the fragment identifier of the URL contains colons (':'), which linkchecker claims to be "not normalized". According to my reading of RFC 3986 ':' is allowed, but the specification is quiet unclear if the character must be escaped, as ':' is in the "reserved" set.

While researching this issue I found https://jazzy.id.au/static/fragment-encoding-test.html, which indicates that escaping as few as possible seems to be a good thing for browser compatibility.

http://pythonhosted.org/uritools/ also does not escape the ':'.

Maybe the following patch is in order:

From 2d6f2eeecb6ed6ca31e2e3cf0760b538b2f39c48 Mon Sep 17 00:00:00 2001
Message-Id: <2d6f2eeecb6ed6ca31e2e3cf0760b538b2f39c48.1436776003.git.hahn@univention.de>
From: Philipp Hahn <[email protected]>
Date: Mon, 13 Jul 2015 10:24:44 +0200
Subject: [PATCH] Fix fragment identifier quoting
Organization: Univention GmbH, Bremen, Germany

According to <https://tools.ietf.org/html/rfc3986>:
 fragment    = *( pchar / "/" / "?" )
 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
 unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
 pct-encoded = "%" HEXDIG HEXDIG
 sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

---
 linkcheck/url.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/linkcheck/url.py b/linkcheck/url.py
index 6263fb7..a3a1dd3 100644
--- a/linkcheck/url.py
+++ b/linkcheck/url.py
@@ -329,7 +329,7 @@ def url_norm (url, encoding=None):
     urlparts[0] = url_quote_part(urlparts[0], encoding=encoding) # scheme
     urlparts[1] = url_quote_part(urlparts[1], safechars='@:', encoding=encoding) # host
     urlparts[2] = url_quote_part(urlparts[2], safechars=_nopathquote_chars, encoding=encoding) # path
-    urlparts[4] = url_quote_part(urlparts[4], encoding=encoding) # anchor
+    urlparts[4] = url_quote_part(urlparts[4], safechars="!$&'()*+,-./;=?@_~", encoding=encoding) # anchor
     res = urlunsplit(urlparts)
     if url.endswith('#') and not urlparts[4]:
         # re-append trailing empty fragment
-- 
1.9.1

pmhahn avatar Jul 13 '15 10:07 pmhahn

Thank you for the issue report. Sadly this project is dead, and a new team is around with https://github.com/linkcheck/linkchecker for more details please see: #708 Also please close this issue and report it freshly on the new repo https://github.com/linkcheck/linkchecker/issues if your issue still persists

dpalic avatar Oct 29 '17 09:10 dpalic