link-check
link-check copied to clipboard
Double encoding of special chars in URL doesn't work well in all cases
See for instance tcort/markdown-link-check#155
User is looking to check https://en.wikipedia.org/wiki/%3F:
But the ? sign is then decoded by link-check and not reencoded as it's a legit char for a url but that have a specific meaning of being the param part start. So https://en.wikipedia.org/wiki/%3F: becomes https://en.wikipedia.org/wiki/?: meaning https://en.wikipedia.org/wiki/ with parameter :.
We need to find a way to deal with encoded chars without all these issues. There's already a lot of tests for some specific cases in the test suite. They have to continue to work, but for now the risk is to have a pile of specific cases. Finding a generic way to deal with this would be nice.
I encountered the same issue using markdown-link-check failing on the URL https://libraries.io/npm/@action-class%2Fcore/tree
With the reencoding of %2F to / the request results in a 404 since the path argument parsing now receives an additional path component.
In my opinion the manual reencoding is unnecessary since the new URL() call already normalizes the URL.
Example:
new URL("https://example.com/foo%2Fbar/foo bar/?test=arg%20with+spaces&test2=arg unencoded").toString()
results in
https://example.com/foo%2Fbar/foo%20bar/?test=arg%20with+spaces&test2=arg%20unencoded
So my suggestion would be to completely remove the encodeURI and decodeURIComponen calls from https://github.com/tcort/link-check/blob/master/lib/proto/http.js#L40 and rely on the normalization of JavaScripts URL class.
diff --git a/lib/proto/http.js b/lib/proto/http.js
index f6530a4..548e7a8 100644
--- a/lib/proto/http.js
+++ b/lib/proto/http.js
@@ -31,13 +31,8 @@ module.exports = {
let user_agent = opts.user_agent || `${pkg.name}/${pkg.version}`;
- // Decoding and encoding is required to prevent encoding already encoded URLs
- // We decode using the decodeURIComponent as it will decode a wider range of
- // characters that were not necessary to be encoded at first, then we re-encode
- // only the required ones using encodeURI.
- // Note that we don't use encodeURIComponents as it adds too much non-necessary encodings
- // see "Not Escaped" list in https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent#description
- const url = encodeURI(decodeURIComponent(new URL(link, opts.baseUrl).toString()));
+ // rebase relative urls and normalize url encoding
+ const url = new URL(link, opts.baseUrl).toString();
const options = {
user_agent: user_agent,
Alternatively adding an option parameter to skip the reencoding would be highly appreciated. That way an per URL option could be added to the markdown-link-check config to disable this 'feature' for problematic URLs.