web-monitoring-db icon indicating copy to clipboard operation
web-monitoring-db copied to clipboard

Make `safe` option for SURT actually safe

Open Mr0grog opened this issue 8 months ago • 0 comments

In #1215, I introduced a safe: true option for Surt.canonicalize with the intention of using it to normalize URLs. However, there are a few unsafe things that turned out not to have options (and some good safe things that don’t have options to turn on).

We should get that option working correctly. The main thing here is turning off deep/repeated unescaping. This might need some care — I have a feeling there are some places where we do need to unescape once and re-escape (e.g. hostname cleanup), but other places where we should not at all (e.g. paths…?). We may also need a different set of characters to escape for the safe case (see also this note about needing to investigate the Java vs. Python escaping: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/9b389461f832bb26c8fcc8020d51f578c6b10374/app/lib/surt/canonicalize.rb#L78-L83)

While doing this, it might also be good to add an option for upper- vs. lower-casing escape sequences. SURT always lower-cases, but RFC 3986 recommends upper-case as the normalized form (section 2.1, which is the intended use case for “safe” canonicalization.

Mr0grog avatar Mar 14 '25 22:03 Mr0grog