go-toolkit icon indicating copy to clipboard operation
go-toolkit copied to clipboard

Implement efficient object storage streaming handler

Open chocolatkey opened this issue 1 year ago • 0 comments

While the toolkit can stream files from local file systems, in many cases users of the toolkit will want to stream publications from an object storage provider, such as Amazon S3 or GCP Cloud Storage. This may seam trivial to implement at first glance ("just hook up S3 to the streamer!") but doing so efficiently is difficult due to the nature of reading ZIP files (EPUB, CBZ etc.).

There is already an optional "minimized read" utility in the ZIP archive reader in the toolkit, but this only works well when paired with lower-level optimizations in the reading of the ZIP itself. The following diagram shows how many reads are needed just to generate a WebPub manifest. For local filesystems, this is perfectly fine and efficient, but when performing the reads on a file located across the web, each additional request adds additional latency. If no optimizations are performed, the latency has a big impact on the performance of whatever software a user of the go-toolkit is writing, not to mention the additional costs of the requests (many object storage providers charge by # of requests). Below is an example of the reads that occur for opening the Moby Dick EPUB file: Reading moby-dick.epub to generate a WebPub Manifest

I plan on porting my cloud storage reading logic to the go-toolkit to address this issue.

chocolatkey avatar Sep 07 '24 23:09 chocolatkey