freeze-dry icon indicating copy to clipboard operation
freeze-dry copied to clipboard

Only grab necessary subresources

Open Treora opened this issue 6 years ago • 4 comments

Currently, we inline all resolutions listed in an <img>'s srcset, all <audio> and <video> sources, all stylesheets, etcetera. This makes snapshots huge. The upside is that the snapshot will be as rich as the original, and more likely to work and look as intended in various browsers and screen resolutions. Depending on the application, one or the other factor may be more important, so it would be nice to make configurable how much we grab. Some preliminary thoughts on this:

  • One reasonable desire is to grab only things that are currently in use (if this can be tested for). This could help a lot with speeding freeze-dry up, as those things may be available from cache.

  • For images with multiple resolutions, we could read element.currentSrc, and only grab that one. And/or perhaps get the one with highest resolution.

  • For audio and video, the sources are usually different file formats; currentSrc seems a reasonable choice again, or some prewired preference to pick a widely supported and/or well compressed format (again a possible trade-off).

  • For stylesheets, we may filter by media queries, both in a media attribute on a <link> (to omit the whole stylesheet), or @media at-rules inside stylesheets (to omit the subresources it affects). The next question is then what media queries to fliter for; type (screen/print), window size; possibly again only take what is currently active.

  • For fonts, we could take only the ones currently used/loaded (how? the status attributes of fonts in document.fonts?). And we could hard-code a preference for some well compressed and/or widely supported file format.

Treora avatar Sep 13 '18 19:09 Treora

Another optimization could be done for url()s in CSS:

  • include a new <style> block at the top of <head>
  • store any inlined dataurl there as a CSS variable
  • use this variable in the url() definition currently being inlined
  • save a map somewhere - url => css variable
  • if the same URL should be inlined again, directly use the existing CSS variable

Could <link> elements to CSS files be replaced with <style>, and the CSS be included as plaintext, not base64 converted? Should save another 30% for CSS.

Another thing I came across is in a responsive image, where the srcset value was pointing to an image that did not exist. Therefore the website did return a 404 error page, but then this was inlined instead of being omitted which added some 2MBs to the dump. So checking the MIME type when loading image sources, and omitting non-images, could further help?

JYone3A avatar Feb 01 '23 13:02 JYone3A

Thanks for the tips!

Could <link> elements to CSS files be replaced with <style>, and the CSS be included as plaintext, not base64 converted? Should save another 30% for CSS.

The very first implementation of freeze-dry did convert linked stylesheets to <style> elements. I changed it to the current behaviour for two reasons:

  1. We’d have to ensure the stylesheet does not corrupt the HTML (see issue #17). This could probably be done though.
  2. I tried to make the resulting tree match its original as closely as possible. However there are several optimisations that would often be desirable, so this ‘as-close-as-possible’ goal should probably become optional, so people can choose depending on their use case.

Another thing I came across is in a responsive image, where the srcset value was pointing to an image that did not exist. Therefore the website did return a 404 error page, but then this was inlined instead of being omitted which added some 2MBs to the dump. So checking the MIME type when loading image sources, and omitting non-images, could further help?

See issue #31. This should be easy to solve. We could e.g. check for response.ok in Resource.fromLink and throw if it is false.. I don’t have time to test it now, but be most welcome to try it out if you like!

PS I’m also always curious to know what people use freeze-dry for. It’s a little sad to only hear about its issues.. ;)

Treora avatar Feb 02 '23 15:02 Treora

Besides catching 404s, seems like setting textContent on style nodes does work, this should allow the Browser to catch anything invalid?

I will at some point try to setup everything so I can play around with the code, however, never used Typescript before, so lets see when I have enough time to check that out.

PS I’m also always curious to know what people use freeze-dry for. It’s a little sad to only hear about its issues.. ;)

I can imagine. I'm making a (internal) Firefox extension including Apache WebAnnotator and freezeDry (and probably I'll also use Mozillas Readability to save a plaintext version) that talks to an internal tool to save the data.

The idea/plan is to use this whenever we have to research some info on the web that later goes into our database, to be able to document where this info came from. The frozen websites including the annotations will then be shown in an iframe within the tool.

JYone3A avatar Feb 07 '23 15:02 JYone3A

Besides catching 404s, seems like setting textContent on style nodes does work, this should allow the Browser to catch anything invalid?

Maybe? Pay attention to check escaping of > characters as &gt; etc. It could even happen that updating the DOM works fine, but serialising to HTML and parsing it again breaks.

Nice to hear your use case! Also good to hear another use case of Apache Annotator (you might have noticed my name a lot in there too..).

Treora avatar Feb 08 '23 14:02 Treora