link-check Double encoding of special chars in URL doesn't work well in all cases

Double encoding of special chars in URL doesn't work well in all cases

Open NicolasMassart opened this issue 4 years ago • 1 comments

trafficstars

See for instance tcort/markdown-link-check#155 User is looking to check https://en.wikipedia.org/wiki/%3F: But the ? sign is then decoded by link-check and not reencoded as it's a legit char for a url but that have a specific meaning of being the param part start. So https://en.wikipedia.org/wiki/%3F: becomes https://en.wikipedia.org/wiki/?: meaning https://en.wikipedia.org/wiki/ with parameter :. We need to find a way to deal with encoded chars without all these issues. There's already a lot of tests for some specific cases in the test suite. They have to continue to work, but for now the risk is to have a pile of specific cases. Finding a generic way to deal with this would be nice.

Mar 11 '21 11:03 NicolasMassart

I encountered the same issue using markdown-link-check failing on the URL https://libraries.io/npm/@action-class%2Fcore/tree

With the reencoding of %2F to / the request results in a 404 since the path argument parsing now receives an additional path component.

In my opinion the manual reencoding is unnecessary since the new URL() call already normalizes the URL.

Example:

new URL("https://example.com/foo%2Fbar/foo bar/?test=arg%20with+spaces&test2=arg unencoded").toString()

results in

https://example.com/foo%2Fbar/foo%20bar/?test=arg%20with+spaces&test2=arg%20unencoded

So my suggestion would be to completely remove the encodeURI and decodeURIComponen calls from https://github.com/tcort/link-check/blob/master/lib/proto/http.js#L40 and rely on the normalization of JavaScripts URL class.

diff --git a/lib/proto/http.js b/lib/proto/http.js
index f6530a4..548e7a8 100644
--- a/lib/proto/http.js
+++ b/lib/proto/http.js
@@ -31,13 +31,8 @@ module.exports = {
 
         let user_agent = opts.user_agent || `${pkg.name}/${pkg.version}`;
 
-        // Decoding and encoding is required to prevent encoding already encoded URLs
-        // We decode using the decodeURIComponent as it will decode a wider range of 
-        // characters that were not necessary to be encoded at first, then we re-encode
-        // only the required ones using encodeURI.
-        // Note that we don't use encodeURIComponents as it adds too much non-necessary encodings
-        // see "Not Escaped" list in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent#description
-        const url = encodeURI(decodeURIComponent(new URL(link, opts.baseUrl).toString()));
+        // rebase relative urls and normalize url encoding
+        const url = new URL(link, opts.baseUrl).toString();

         const options = {
             user_agent: user_agent,

Alternatively adding an option parameter to skip the reencoding would be highly appreciated. That way an per URL option could be added to the markdown-link-check config to disable this 'feature' for problematic URLs.

Apr 18 '23 10:04 jan-guenter

link-check link-check copied to clipboard

Double encoding of special chars in URL doesn't work well in all cases

link-check
link-check copied to clipboard