linkedom Double encoding of anchor tag href

Hey,

I've been using your amazing library to extract links from DOM content in a service worker and noticed that achor tags which, for example, lead to some file which has already encoded space characters in the URL (www.test.com/path%20to%20some%20file.pdf) end up with a broken href because the % character is encoded again (www.test.com/path%2520to%2520some%2520file.pdf).

I have a workaround by just calling decodeURIComponent on the href attribute before processing it but I guess that is not the intended behavior and shouldn't be like that. I am sure that it's related to https://github.com/WebReflection/linkedom/issues/49 and the fix for it here https://github.com/WebReflection/linkedom/blob/5b31c583c79423d97fa1982d9a30a8f0a0982485/cjs/html/anchor-element.js#L18

I would guess and hope it's not a complicated fix.

Thanks a lot for your work!

Jan 07 '23 17:01 marcelreppi

this is weird to fix because AFAIK the href getter does sanitize its content as attribute ... can you please write a test that works on browsers but fails in here?

Mar 21 '23 19:03 WebReflection

Imagine an anchor tag pointing to a file. This file has space characters in its filename which means the " " character will already be encoded as "%20". For example:

<a href="https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf"></a>

When I add the following test to test/html/anchor-element.js it fails.

a.setAttribute('href', 'https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf');
assert(a.href, 'https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf');

I would expect to get the same link because otherwise the link is broken.

 Expected: https://www.westonpriory.org/esales/lyrics/Something%20Which%20Is%20Known.pdf
 Got instead: https://www.westonpriory.org/esales/lyrics/Something%2520Which%2520Is%2520Known.pdf

I opened a PR to fix this: https://github.com/WebReflection/linkedom/pull/204.

May 12 '23 20:05 marcelreppi

Agreed and I hate double encoding/decoding shenanigans myself ... the thing I am not understanding is where that string gets encoded in the first place and, in case I really want to write manually a url with %20 in it, if it's the right thing to decide that regardless as space ... if it's the 3rd party library doing that, or me doing that while processing the HTML/XML/SVG though, I believe the gotcha should be solved there or I should pay more attention to every other sensible accessor that points at attributes already encoded ... thoughts?

May 13 '23 08:05 WebReflection

I am thinking mainly of file servers that serve static files and <a> tags that point to those files. When those file names contain spaces, the paths to those files must encode special characters like spaces (unless the server has some internal logic to make them available under different paths). In other cases like you mentioned with 3rd party libraries I would agree with you.

If you have concerns I can also just use the workaround that I mentioned in the issue to move that logic outside of the library but then, in my case, the library would not have the same behavior as regular DOM queries.

May 28 '23 06:05 marcelreppi

linkedom linkedom copied to clipboard

Double encoding of anchor tag href

linkedom
linkedom copied to clipboard