freeze-dry
freeze-dry copied to clipboard
Only grab necessary subresources
Currently, we inline all resolutions listed in an <img>
's srcset
, all <audio>
and <video>
sources, all stylesheets, etcetera. This makes snapshots huge. The upside is that the snapshot will be as rich as the original, and more likely to work and look as intended in various browsers and screen resolutions. Depending on the application, one or the other factor may be more important, so it would be nice to make configurable how much we grab. Some preliminary thoughts on this:
-
One reasonable desire is to grab only things that are currently in use (if this can be tested for). This could help a lot with speeding freeze-dry up, as those things may be available from cache.
-
For images with multiple resolutions, we could read
element.currentSrc
, and only grab that one. And/or perhaps get the one with highest resolution. -
For audio and video, the sources are usually different file formats;
currentSrc
seems a reasonable choice again, or some prewired preference to pick a widely supported and/or well compressed format (again a possible trade-off). -
For stylesheets, we may filter by media queries, both in a
media
attribute on a<link>
(to omit the whole stylesheet), or@media
at-rules inside stylesheets (to omit the subresources it affects). The next question is then what media queries to fliter for; type (screen/print), window size; possibly again only take what is currently active. -
For fonts, we could take only the ones currently used/loaded (how? the
status
attributes of fonts indocument.fonts
?). And we could hard-code a preference for some well compressed and/or widely supported file format.
Another optimization could be done for url()
s in CSS:
- include a new
<style>
block at the top of<head>
- store any inlined dataurl there as a CSS variable
- use this variable in the
url()
definition currently being inlined - save a map somewhere -
url => css variable
- if the same URL should be inlined again, directly use the existing CSS variable
Could <link>
elements to CSS files be replaced with <style>
, and the CSS be included as plaintext, not base64 converted? Should save another 30% for CSS.
Another thing I came across is in a responsive image, where the srcset
value was pointing to an image that did not exist. Therefore the website did return a 404 error page, but then this was inlined instead of being omitted which added some 2MBs to the dump. So checking the MIME type when loading image sources, and omitting non-images, could further help?
Thanks for the tips!
Could
<link>
elements to CSS files be replaced with<style>
, and the CSS be included as plaintext, not base64 converted? Should save another 30% for CSS.
The very first implementation of freeze-dry did convert linked stylesheets to <style>
elements. I changed it to the current behaviour for two reasons:
- We’d have to ensure the stylesheet does not corrupt the HTML (see issue #17). This could probably be done though.
- I tried to make the resulting tree match its original as closely as possible. However there are several optimisations that would often be desirable, so this ‘as-close-as-possible’ goal should probably become optional, so people can choose depending on their use case.
Another thing I came across is in a responsive image, where the
srcset
value was pointing to an image that did not exist. Therefore the website did return a 404 error page, but then this was inlined instead of being omitted which added some 2MBs to the dump. So checking the MIME type when loading image sources, and omitting non-images, could further help?
See issue #31. This should be easy to solve. We could e.g. check for response.ok
in Resource.fromLink
and throw if it is false.. I don’t have time to test it now, but be most welcome to try it out if you like!
PS I’m also always curious to know what people use freeze-dry for. It’s a little sad to only hear about its issues.. ;)
Besides catching 404s, seems like setting textContent
on style nodes does work, this should allow the Browser to catch anything invalid?
I will at some point try to setup everything so I can play around with the code, however, never used Typescript before, so lets see when I have enough time to check that out.
PS I’m also always curious to know what people use freeze-dry for. It’s a little sad to only hear about its issues.. ;)
I can imagine. I'm making a (internal) Firefox extension including Apache WebAnnotator and freezeDry (and probably I'll also use Mozillas Readability to save a plaintext version) that talks to an internal tool to save the data.
The idea/plan is to use this whenever we have to research some info on the web that later goes into our database, to be able to document where this info came from. The frozen websites including the annotations will then be shown in an iframe within the tool.
Besides catching 404s, seems like setting
textContent
on style nodes does work, this should allow the Browser to catch anything invalid?
Maybe? Pay attention to check escaping of >
characters as >
etc. It could even happen that updating the DOM works fine, but serialising to HTML and parsing it again breaks.
Nice to hear your use case! Also good to hear another use case of Apache Annotator (you might have noticed my name a lot in there too..).