Allow urlencoded data URLs
This automatically selects urlencoded data URLs if that results in smaller output then base64 encoding them.
This is a kinda stupid idea I had as I saw the size of some really large dumps (for example from comic pages on tapas.io)...
It seems to work... In a sample page from tapas.io, it reduces the final size from 213,233,452 bytes to 170,968,176 bytes, which is about 20% smaller.
The characters which are percent-encoded come from https://datatracker.ietf.org/doc/html/rfc3986#section-2.2, since that is refernced from https://developer.mozilla.org/en-US/docs/Web/URI/Reference/Schemes/data - I found no concrete list of characters which should be escaped in the data-URL RFC (https://www.rfc-editor.org/rfc/rfc2397) - Additionally, we escape %, so nested data: URLs (PNGs in CSS anyone?) work and ", so we don't accidentally close the quotes surrounding the data URL.
I'm not sure if the encoding is correct for exotic (non-UTF8) charsets... Please advise if I should add more tests testing such scenarios.
(PS: Feel free to close this PR if this all sounds too stupid/mad)
Hello Tobias,
Thank you very much for this PR!
Not a stupid idea at all, base64 unnecessarily bloats up plaintext, something like 30% longer result on average than using URL encoding.
Originally I created https://github.com/Y2Z/dataurl to move the code that parses and creates data URLs out of Monolith, and make dataurl available as both a crate and CLI tool. It's somewhere in my backlog to switch to using it along with only using base64 for binary data.
I'll review your PR briefly and get back to you.