stork
stork copied to clipboard
Demo: compress index to save network bandwidth
The demo site https://stork-search.net/ downloads a 1.7M index of the Federalist Papers uncompressed over the network.
To better evaluate the impact of adding stork search to a static site, it would help to precompress said example index.
Brotli compression with default settings compresses the same index down to 322K.
Likewise for the 180K .wasm file.
That's a solid plan. I briefly looked into different compression options, because I quickly realized that for whatever reason these index files are highly compressible. I'll close this issue when the CDN starts doing that compression!
Hey @jameslittle230, out of curiosity: have you considered using fst as the data structure for your index ?
@ngirard I have not (hadn’t heard of it), and from a quick glance, I can’t tell what the benefit would be. Can you tell me more about how it might make Stork better?
@jameslittle230, TBH I haven't given much thought about it. It just popped out of my mind as I was skimming through your project's pages and read that your index files were being lengthy.
Since fst
can produce compact indices, I thought it could help, but I might very well be wrong!
@ngirard - that makes sense. I was taking another look at it last night and was trying to think about where it slots in — I have some ideas that I want to try out.
Thanks for letting me know about the library — much appreciated. :)
@jameslittle230, in any case I'm glad I introduced you to this nice crate.
And thank you for investing your time into this nice project of yours.
Cheers & take care !
Coming back to the compression aspect of this: it looks like I'll have to:
- Set up infrastructure to compress the files automatically on every deploy
- Use that infrastructure to upload uncompressed, gzipped, and brotli'd files to S3
- Write (or lift) up a Lambda@edge function to switch on the incoming
Accept-Encoding
header which will rewrite the requested URI to send the correct file from the S3 bucket - Test that different requests with different
Accept-Encoding
headers are receiving different bits over the wire
CloudFront has native Brotli support. Is that not supported for your file types for any reason?
@monken - the WASM file and the index are not served with the MIME types in the File types that CloudFront Compresses list.
The index size is concerning to me as I consider using this. The size of the first 20 Federalist Papers is only 241KB. If the index is 1.13MB then the index is over 4 times the size of the indexed data.
I have a static site for a book where the text data is 1.1MB broken up into 22 files. If integrating search would add about 5MB to the page load size, it seems prohibitive.
On a related note, why are you only searching the first 20 papers? Is it because the Stork can't handle the whole thing for some reason?
@jtbayly - It's a fair point! Over time, I hope to be able to make improvements to the index file format to reduce their size.
The 20-file limit in the demo, though, is unrelated. To build the demo, I manually pulled each paper from the source and cleaned up the text by hand. I stopped doing that once I reached 20 because it ended up taking more of my time than it was worth.
It might be a better idea to add fflate support (small and fast gzip/zip/deflate compressor and decompressor) so no extra server configuration is needed, since stork seems to be mostly for static sites.