linkedom
linkedom copied to clipboard
Double encoding of anchor tag href
Hey,
I've been using your amazing library to extract links from DOM content in a service worker and noticed that achor tags which, for example, lead to some file which has already encoded space characters in the URL (www.test.com/path%20to%20some%20file.pdf
) end up with a broken href
because the %
character is encoded again (www.test.com/path%2520to%2520some%2520file.pdf
).
I have a workaround by just calling decodeURIComponent
on the href
attribute before processing it but I guess that is not the intended behavior and shouldn't be like that.
I am sure that it's related to https://github.com/WebReflection/linkedom/issues/49 and the fix for it here https://github.com/WebReflection/linkedom/blob/5b31c583c79423d97fa1982d9a30a8f0a0982485/cjs/html/anchor-element.js#L18
I would guess and hope it's not a complicated fix.
Thanks a lot for your work!
this is weird to fix because AFAIK the href
getter does sanitize its content as attribute ... can you please write a test that works on browsers but fails in here?
Imagine an anchor tag pointing to a file. This file has space characters in its filename which means the " " character will already be encoded as "%20". For example:
<a href="https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf"></a>
When I add the following test to test/html/anchor-element.js
it fails.
a.setAttribute('href', 'https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf');
assert(a.href, 'https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf');
I would expect to get the same link because otherwise the link is broken.
Expected: https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf
Got instead: https://www.westonpriory.org/esales/lyrics/Something%2520Which%2520Is%2520Known.pdf
I opened a PR to fix this: https://github.com/WebReflection/linkedom/pull/204.
Agreed and I hate double encoding/decoding shenanigans myself ... the thing I am not understanding is where that string gets encoded in the first place and, in case I really want to write manually a url with %20
in it, if it's the right thing to decide that regardless as
space ... if it's the 3rd party library doing that, or me doing that while processing the HTML/XML/SVG though, I believe the gotcha should be solved there or I should pay more attention to every other sensible accessor that points at attributes already encoded ... thoughts?
I am thinking mainly of file servers that serve static files and <a>
tags that point to those files. When those file names contain spaces, the paths to those files must encode special characters like spaces (unless the server has some internal logic to make them available under different paths). In other cases like you mentioned with 3rd party libraries I would agree with you.
If you have concerns I can also just use the workaround that I mentioned in the issue to move that logic outside of the library but then, in my case, the library would not have the same behavior as regular DOM queries.