storage icon indicating copy to clipboard operation
storage copied to clipboard

Define size of all storage actions

Open annevk opened this issue 5 years ago • 5 comments

In order to give developers a more consistent experience across browsers, while allowing browsers to compress, deduplicate, and otherwise optimize the stored data, we should standardize the upper bound for each storage action and have all browsers enforce that.

E.g., the size of localStorage[key] = value could be (key's code unit length + value's code unit length) × 2 + 16 bytes of safety padding or some such. (I did not put a lot of thought into this. If we go down this path we'd need to do that.)

(See 6 in https://github.com/whatwg/storage/issues/95#issuecomment-656555686 and reply for context.)

annevk avatar Jul 10 '20 15:07 annevk

This seems desirable and has indeed come up before. Specifically, in terms of allowing structured serialized storage of data on things like ServiceWorker registrations and related data (ex: Notification.data) where it would be desirable to place an upper bound on storage but is an interop nightmare without this issue addressed.

I believe this would require the serialization steps for [Serializable] to also produce a size/upper-bound value as well?

It seems like the most complex issues are;

  1. Blob/File and any similarly immutable abstractions which allow implementations like IndexedDB to store a single copy of the data on disk. Firefox only stores a single copy of a given Blob/File (based on object identity, independent of contents). I presume the only course of action is to either standardize this or to tally each time the blob is used in a structured serialization (which will be-duplicate internally via its "memory"). If standardized, interesting and terrifying new possibilities are raised, such as the BlobStore being its own storage endpoint which can then be used by Notification.data and even ServiceWorker's Cache API storage.
  2. Compression. It would be unfortunate for implementations to be able to implement CPU/power/disk-efficient native storage of data but need to charge a high quota cost, resulting in content performing less efficient compression in JS/WASM in order to be charged a lower quota cost but actually use more disk space. Presumably the answer is Compression Streams? But this is still awkward because, for example, Firefox currently uses Snappy (for Cache API storage) and wants to use LZ4 (for Cache API storage and IndexedDB), and neither of those are yet specified and it would be arguably silly to run gzip against data just for the purposes of calculating a more generous quota charge while actually storing the data using LZ4.

asutherland avatar Jul 10 '20 18:07 asutherland

Thank you very much for opening a specific issue for this topic!

Reiterating here for clarity -- Chrome is supportive of this effort to come up with an abstract cost model for storage. We'd be willing to take on the (quite non-trivial) implementation costs if the model gains cross-browser acceptance.

I also really like that @asutherland brought up some of the complex issues early on. I'd be tempted to follow the solutions of other systems I'm aware of.

  1. Blobs: Charge a separate copy per item. I claim this approach is more intuitive to users -- you're charged for what you write, with decisions made locally. Implementers get the benefits from content de-duplication as operational cost reduction. I think this approach would also make the proposal more palatable, because we'd be avoiding asking browsers to implement content de-duplication to be compliant.

  2. Compression: Charge for uncompressed data. Same reasoning as above -- it's more intuitive to be charged for what you write. Also, unless we mandate that each object is compressed individually, compression ratios depend on adjacent data, so I think we'd end up with a lot of constraints around physical data layout. I'd strongly prefer that specs don't get into this business :smile:

On a brighter note, the zstd benchmarks suggest that the algorithms we'd consider have ratios within 2x of each other (and below 3x of uncompressed) for "typical" data. I claim this is well within the precision margin for the cost model we'd be building up here.

Along the same lines, I hope that we can avoid having apps play games (like manual compression) by being reasonably generous with quota. Ideally, apps without bugs should not run into quota problems.

pwnall avatar Jul 22 '20 14:07 pwnall

I found some notes from when I tried to sketch a storage cost model for IndexedDB. This was in 2018, and I knew a lot less about the implementation back then. So, the numbers are probably bad, but at least it's a list of things to consider.

Object cost:

  • primitives (number, Date, null, true, false): 10 -- accommodates (tag + 8 bytes or tag + <= 9 bytes of varint)
  • string: 8 + 2 * string length
  • object: 8 + sum of keys and values
  • array: 16 + sum of elements
  • native arrays: 16 + buffer length
  • Imagedata: 32 + the cost of ImageData.data as a native array
  • Blob: 64 + cost of MIME type as string + length
  • File: Blob + cost of filename as string

I might have missed some other object. The idea is to assign a cost based on a straightforward representation for each clonable. The cost doesn't have to be exact, because we expect implementations to have their own overhead.

IndexedDB transaction costs (get refunded when the transaction completes):

  • 32 per open store and index in a transaction; write transactions open all indexes in their stores; versionchange transactions open all stores and indexes
  • write: 64 + inputs (key + value) + sum over indexes touched (16 + index key + primary key)
  • delete: like a write, but with zero value cost
  • store creation: 64 + store name and key path as strings
  • index creation: 64 + index name and key path as strings
  • store/index renames: same as creation
  • store/index deletion: 64; deleting a store implies deleting all its indexes

This isn't a complete list. I hope it's a good starting point if someone is itching to start an explainer :smile:

pwnall avatar Jul 22 '20 14:07 pwnall

@pwnall Your simplifying proposal in https://github.com/whatwg/storage/issues/110#issuecomment-662493325 sounds good to me. Also, it's very consistent with reality, as Mozilla's Servo project is an example of bringing up a browser from scratch-ish and they've found implementing IndexedDB non-trivial, so further complicating the standard and raising the bar to building a compliant browser engine would not be a win for the web.

asutherland avatar Jul 22 '20 21:07 asutherland

See also: https://github.com/whatwg/html/issues/4914.

annevk avatar Oct 26 '20 08:10 annevk