SwiftSoup icon indicating copy to clipboard operation
SwiftSoup copied to clipboard

Clean document and encoding for maito: protocol results in unexpected output.

Open TasikBeyond opened this issue 1 year ago • 2 comments

Bug Report

The Clean document function is encoding characters twice. Only happening when a %20 and a [ or ] are included in the original html data.

How to Reproduce

let html = #"<a href="mailto:[email protected]?subject=Job%20Requisition[NID]">Send</a></body></html>"#

let document = try SwiftSoup.parse(html)
let outputSettings = OutputSettings()
outputSettings.prettyPrint(pretty: false)
document.outputSettings(outputSettings)

let headWhitelist: Whitelist = {
    do {
        let customWhitelist = Whitelist.none()
        try customWhitelist
            .addTags("a")
            .addAttributes("a", "href")
            .addProtocols("a", "href", "mailto")
        return customWhitelist
    } catch {
        fatalError("Couldn't init head whitelist")
    }
}()
try headWhitelist

print("Original Document: ", document)
let cleaned = try Cleaner(headWhitelist: headWhitelist, bodyWhitelist: headWhitelist).clean(document)
print("Original Document: ", document)
print("Clean Document: ", cleaned)

Expected Behavior

Clean let html = #"<a href="mailto:[email protected]?subject=Job%20Requisition[NID]">Send</a></body></html>"#

Should result in

<html>
 <head></head>
 <body>
  <a href="mailto:[email protected]?subject=Job%20Requisition%5BNID%5B">Send</a>
 </body>
</html>

Actual Behavior

<html>
 <head></head>
 <body>
  <a href="mailto:[email protected]?subject=Job%2520Requisition%5BNID%5D">Send</a>
 </body>
</html>

Note: %2520 appears to be %20 getting encoded again.

Environment

Swift Soup Version: 2.6.1 Xcode Version: 15.3

Additional Notes

I print the original document before and after the clean(document) function as it appears both the original document and the clean document are being modified.

print("Original Document: ", document)
let cleaned = try Cleaner(headWhitelist: headWhitelist, bodyWhitelist: headWhitelist).clean(document)
print("Original Document: ", document)

TasikBeyond avatar May 03 '24 16:05 TasikBeyond

still an issue? please submit a PR, or at least test coverage

aehlke avatar Feb 28 '25 21:02 aehlke

tl;dr: Adding .preserveRelativeLinks(true) to your whitelist creating works around the issue. Otherwise escaping cannot be prevented completely easily, but can be fixed to avoid the double-escaping for the %20.


This one is pretty tricky, it's an issue with Apple's URL parsing and the whitelist settings.

If the whitelist's .preserveRelativeLinks(true) is not set (like in your example), SwiftSoup tries to resolve URLs to turn relative into absolute URLs. So simply adding .preserveRelativeLinks(true) to your whitelist creating works around the issue.

The escaping is done when it gets passed to URL(string: relURL) in StringUtil.resolve(_:relUrl:). As you've discovered, as soon as there's a [ or ], Apple suddenly also escapes the %20. Maybe there are other characters that also trigger this behaviour. I found two solutions to work around the escaping of the %20 but the [ and ] still get escaped, resulting in mailto:[email protected]?subject=Job%20Requisition%5BNID%5D. Better than before but both of my solutions aren't very good:

  1. Use CFURLCreateWithString(nil, relUrl as CFString, nil) as URL?. This works but on Linux this API is probably not available.
  2. Use URLComponents, split the URL at the first ? and feed the first part to the initialiser, percent-unescape the second part, and feed that to urlComponents.query (which then re-escapes it).

Both solutions result in mailto:[email protected]?subject=Job%20Requisition%5BNID%5D as the output. Unless I find a better on solution I'm going with CFURLCreateWithString for Apple platforms and fall back to URLComponents on the other ones.


I also found another issue while working on this: the URL modifications were done to the original element's attribute, thus breaking the promise of Cleaner to not modify its source document. I've fixed that.

DarkDust avatar Aug 18 '25 17:08 DarkDust