link-check icon indicating copy to clipboard operation
link-check copied to clipboard

Double encoding of special chars in URL doesn't work well in all cases

Open NicolasMassart opened this issue 4 years ago • 1 comments
trafficstars

See for instance tcort/markdown-link-check#155 User is looking to check https://en.wikipedia.org/wiki/%3F: But the ? sign is then decoded by link-check and not reencoded as it's a legit char for a url but that have a specific meaning of being the param part start. So https://en.wikipedia.org/wiki/%3F: becomes https://en.wikipedia.org/wiki/?: meaning https://en.wikipedia.org/wiki/ with parameter :. We need to find a way to deal with encoded chars without all these issues. There's already a lot of tests for some specific cases in the test suite. They have to continue to work, but for now the risk is to have a pile of specific cases. Finding a generic way to deal with this would be nice.

NicolasMassart avatar Mar 11 '21 11:03 NicolasMassart

I encountered the same issue using markdown-link-check failing on the URL https://libraries.io/npm/@action-class%2Fcore/tree

With the reencoding of %2F to / the request results in a 404 since the path argument parsing now receives an additional path component.

In my opinion the manual reencoding is unnecessary since the new URL() call already normalizes the URL.

Example:

new URL("https://example.com/foo%2Fbar/foo bar/?test=arg%20with+spaces&test2=arg unencoded").toString()

results in

https://example.com/foo%2Fbar/foo%20bar/?test=arg%20with+spaces&test2=arg%20unencoded

So my suggestion would be to completely remove the encodeURI and decodeURIComponen calls from https://github.com/tcort/link-check/blob/master/lib/proto/http.js#L40 and rely on the normalization of JavaScripts URL class.

diff --git a/lib/proto/http.js b/lib/proto/http.js
index f6530a4..548e7a8 100644
--- a/lib/proto/http.js
+++ b/lib/proto/http.js
@@ -31,13 +31,8 @@ module.exports = {
 
         let user_agent = opts.user_agent || `${pkg.name}/${pkg.version}`;
 
-        // Decoding and encoding is required to prevent encoding already encoded URLs
-        // We decode using the decodeURIComponent as it will decode a wider range of 
-        // characters that were not necessary to be encoded at first, then we re-encode
-        // only the required ones using encodeURI.
-        // Note that we don't use encodeURIComponents as it adds too much non-necessary encodings
-        // see "Not Escaped" list in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent#description
-        const url = encodeURI(decodeURIComponent(new URL(link, opts.baseUrl).toString()));
+        // rebase relative urls and normalize url encoding
+        const url = new URL(link, opts.baseUrl).toString();

         const options = {
             user_agent: user_agent,

Alternatively adding an option parameter to skip the reencoding would be highly appreciated. That way an per URL option could be added to the markdown-link-check config to disable this 'feature' for problematic URLs.

jan-guenter avatar Apr 18 '23 10:04 jan-guenter