stork icon indicating copy to clipboard operation
stork copied to clipboard

Demo: compress index to save network bandwidth

Open cloudspeech opened this issue 4 years ago • 12 comments

The demo site https://stork-search.net/ downloads a 1.7M index of the Federalist Papers uncompressed over the network.

To better evaluate the impact of adding stork search to a static site, it would help to precompress said example index.

Brotli compression with default settings compresses the same index down to 322K.

Likewise for the 180K .wasm file.

cloudspeech avatar Mar 07 '20 16:03 cloudspeech

That's a solid plan. I briefly looked into different compression options, because I quickly realized that for whatever reason these index files are highly compressible. I'll close this issue when the CDN starts doing that compression!

jameslittle230 avatar Mar 07 '20 20:03 jameslittle230

Hey @jameslittle230, out of curiosity: have you considered using fst as the data structure for your index ?

ngirard avatar Apr 06 '20 17:04 ngirard

@ngirard I have not (hadn’t heard of it), and from a quick glance, I can’t tell what the benefit would be. Can you tell me more about how it might make Stork better?

jameslittle230 avatar Apr 06 '20 18:04 jameslittle230

@jameslittle230, TBH I haven't given much thought about it. It just popped out of my mind as I was skimming through your project's pages and read that your index files were being lengthy. Since fst can produce compact indices, I thought it could help, but I might very well be wrong!

ngirard avatar Apr 06 '20 18:04 ngirard

@ngirard - that makes sense. I was taking another look at it last night and was trying to think about where it slots in — I have some ideas that I want to try out.

Thanks for letting me know about the library — much appreciated. :)

jameslittle230 avatar Apr 07 '20 16:04 jameslittle230

@jameslittle230, in any case I'm glad I introduced you to this nice crate.

And thank you for investing your time into this nice project of yours.

Cheers & take care !

ngirard avatar Apr 07 '20 16:04 ngirard

Coming back to the compression aspect of this: it looks like I'll have to:

  1. Set up infrastructure to compress the files automatically on every deploy
  2. Use that infrastructure to upload uncompressed, gzipped, and brotli'd files to S3
  3. Write (or lift) up a Lambda@edge function to switch on the incoming Accept-Encoding header which will rewrite the requested URI to send the correct file from the S3 bucket
  4. Test that different requests with different Accept-Encoding headers are receiving different bits over the wire

jameslittle230 avatar Apr 21 '20 03:04 jameslittle230

CloudFront has native Brotli support. Is that not supported for your file types for any reason?

monken avatar Feb 21 '21 14:02 monken

@monken - the WASM file and the index are not served with the MIME types in the File types that CloudFront Compresses list.

jameslittle230 avatar Feb 21 '21 19:02 jameslittle230

The index size is concerning to me as I consider using this. The size of the first 20 Federalist Papers is only 241KB. If the index is 1.13MB then the index is over 4 times the size of the indexed data.

I have a static site for a book where the text data is 1.1MB broken up into 22 files. If integrating search would add about 5MB to the page load size, it seems prohibitive.

On a related note, why are you only searching the first 20 papers? Is it because the Stork can't handle the whole thing for some reason?

jtbayly avatar Apr 02 '21 17:04 jtbayly

@jtbayly - It's a fair point! Over time, I hope to be able to make improvements to the index file format to reduce their size.

The 20-file limit in the demo, though, is unrelated. To build the demo, I manually pulled each paper from the source and cleaned up the text by hand. I stopped doing that once I reached 20 because it ended up taking more of my time than it was worth.

jameslittle230 avatar Apr 03 '21 01:04 jameslittle230

It might be a better idea to add fflate support (small and fast gzip/zip/deflate compressor and decompressor) so no extra server configuration is needed, since stork seems to be mostly for static sites.

easrng avatar Jan 12 '22 01:01 easrng