storage icon indicating copy to clipboard operation
storage copied to clipboard

Proposal: Add detailed usage breakdown in estimate()

Open inexorabletash opened this issue 6 years ago • 16 comments

We've had a frequent request from users of estimate() to provide an per storage type breakdown estimation. For example, something like:

{
   quota: 440922000000,
   usage: 27300000,
   details: {
     indexeddb: 676000,
     cacheapi: 26500000,
     serviceworker: 52800
   }
}

This stems from web apps that have supported for years and/or where multiple apps are present within an origin makes reasoning about what is using up storage difficult. Also exacerbating the problem:

  • Storage APIs don't simply store bytes. There's encoding, metadata, compression, padding, and so on.
  • Not all APIs allow enumeration (e.g. https://github.com/w3c/IndexedDB/issues/31)

Lots of details to figure out, but:

  • Numbers would be estimates.
  • No need for everything to add up to the "usage" value. There should be an understanding that there will overhead/metadata in the system.
  • Would add hooks to this spec to allow other storage specs to define the key.
    • You could imagine sharing that with a more granular Clear-Site-Data header in the future.
  • In Chrome we'd want to surface non-standard/deprecated storage type usage (specifically, file system, websql, appcache), so the spec shouldn't specifically preclude that.

In Chrome we have this plumbed through internally for our DevTools implementation:

image

Thoughts?

inexorabletash avatar Jun 01 '18 20:06 inexorabletash

Would this make it easier to determine the size of opaque responses?

This would also benefit a lot from doing #18.

annevk avatar Jun 02 '18 05:06 annevk

Would this make it easier to determine the size of opaque responses?

I don't believe so; an implementation would need to ensure the padded size was included in whatever storage type was holding them.

This would also benefit a lot from doing #18.

Yep.

inexorabletash avatar Jun 02 '18 06:06 inexorabletash

Just to get it out of my head, the IDL would be:

dictionary StorageEstimate {
  unsigned long long usage;
  unsigned long long quota;
  record<DOMString, unsigned long long> details;
};

(Unless we invented partial enums when I wasn't looking)

inexorabletash avatar Jun 19 '18 22:06 inexorabletash

It seems we could just have a dictionary there as well. And there's partial dictionaries if we wanted to avoid a central registry.

annevk avatar Jun 20 '18 04:06 annevk

Hey @annevk - I work with @inexorabletash on Chrome. We have google properties that are very interested in this, we're ready to move forward with implementation. Here's our current thinking around IDL:

dictionary StorageEstimate {
  unsigned long long usage;
  unsigned long long quota;
  StorageUsageDetails usageDetails;
};

dictionary StorageUsageDetails {
  unsigned long long indexedDB;
  unsigned long long caches;
  unsigned long long serviceWorkerRegistrations;
  unsigned long long other;
};

Would you be open to a PR for the spec? We're hoping to be able to land this experimentally in Chrome soon with tentative web platform tests.

Outstanding questions:

  1. Do we want to include the other field in the UsageDetails? In either case we have unresolved, implicit ambiguity.
    • If we do include this, it would leave developers questioning what exactly usageDetails.other represents.
    • If we exclude the property, developers are left with a different form of ambiguity: The sum of each value in the usage details could potentially be less than the top level usage.
    • In either case, we would need to clearly document to resolve said ambiguity.
  2. What are the naming conventions?
    • Should these adhere to conventions by using pascalCase, or should they refer to the spec short names? Please share any other ideas you may prefer to the two listed. Note: the proposed keys match the property names used to access the corresponding APIs.
  3. As noted up-thread, future specs could use a partial dictionary to extend the set of members. Do we want to define this initial set of members in the storage spec itself or push that to the other specs (Indexed DB, Service Workers) that are defining things?

Is there anything else we should resolve before moving forward with an experimental implementation in Chrome? Have we heard about any interest in the overall API from Microsoft or Apple (i.e. are there contacts we should cc?)

Please note, Chrome intends on reporting applicationCache and some other obsolete and/or non-standard storage systems like fileSystem and webSQL.

jarryd999 avatar Nov 12 '18 18:11 jarryd999

@aliams and @hober might have thoughts for Microsoft and Apple, respectively.

@asutherland should maybe have a second look for Mozilla.

My main concern with this is opaque responses. I'm worried this'll make it easier to determine their size. [@inexorabletash did mention above that he thinks it's unlikely this'll worsen.] (I was actually wondering if we could make the origin opt-in to requiring CORS and only then give them estimates.)

Other than that, I guess I'd opt to exclude other, don't feel strongly on naming (but some investigation into existing identifiers probably make sense), and defining it all inline probably makes sense. Ideally this would follow from primitives the other specifications use to define their storage.

I'm also not super happy with Chrome planning to include obsoleted/proprietary APIs. That could easily make web compatibility issues worse.

cc @tomvangoethem

annevk avatar Nov 13 '18 08:11 annevk

@jarryd999 Can you further elaborate on how the Google properties intend to use this extra data?

Like, is the idea that Cache storage will always be treated as evictable and IndexedDB always as canonical, so Cache can be ignored for storage pressure reasons and only IndexedDB matters? Or is this desired for telemetry so that engineers can say "holy cow, we've got a ton of WebSQL storage still used, we'd better push additional telemetry gathering for WebSQL/push specific cleanup logic/send clear-site-data to anyone reporting that usage and start from scratch", etc. ?

I ask because one of the issues that came up a few times at the 2018 TPAC was the idea that eviction is still problematic because there's still only a single storage "bucket" and things like the relationship between a ServiceWorker registration and the storages it used exists only in the mind of the developers who wrote the code so we continue to be faced with all-or-nothing eviction. It seems like both browsers and the various teams on a site would benefit if we implemented multiple buckets and our estimate included usage by bucket.

Obviously, more buckets is not a simple change since it would have to be plumbed through all existing APIs and there's still the issue of pre-existing data. So the other question in my mind is how are these top-level summaries more useful than returning size-estimates at a more detailed granularity?

For example, in Firefox we have a memory reporting infrastructure "about:memory" (and the underlying API) that provides a hierarchical summation of our memory usage. I could see it being very useful to have entries for "indexedDB/foo" and perhaps "indexedDB/foo/blobs", "indexedDB/foo/key-values", and "indexedDB/foo/indexes". Perhaps even "indexedDB/foo/key-values/large-values" and "indexedDB/foo/key-values/small-values", etc.

The idea there would be that we could spec some explicit names for the top-level of the hierarchy and immediate children, but that browsers could go into more implementation-specific detail if they wanted and it was requested. I imagine we'd limit that level of detail to our own devtools initially, stopping at "indexedDB/DBNAME" granularity for content.

asutherland avatar Nov 13 '18 13:11 asutherland

It seems ideal to cordon off nonstandard or implementation-specific keys into a separate method or substructure, with a name like implementationDefined, to at least slightly reduce the chance of developers taking a dependency on those values and writing non-portable code.

domenic avatar Nov 13 '18 14:11 domenic

We've received feedback that developers would like a breakdown as well. However, when asked about why they would want to know the breakdown specifically, it seems that the use case is to determine if something would fit into the client-side storage before making an attempt to fetch/create the data. @inexorabletash / @jarryd999 are there other use cases that you can share?

aliams avatar Nov 14 '18 02:11 aliams

One use case we have, is to be able to assess how much space we're taking for caching html/js/css resources vs domain model data content. Both can be variable, so it's hard to guess how much space is left for data content for a particular user.

Also, it would help us find bugs in the browser, for example if one storage subsystem is not cleaning up resources correctly on disk eg: crbug.com/782869.

ralphch0 avatar Nov 14 '18 14:11 ralphch0

@annevk - This ambiguity around the size of opaque responses should remain intact given that we include the padding in each entry in Cache Storage, and thus the usageDetails dictionary does not leak any additional information.

@annevk @domenic - Regarding obsoleted/proprietary storage backends, we would like to propose an alternative: The usageDetails dictionary would only contain members whose value is greater than 0. This means that a site that doesn’t use proprietary APIs does not see proprietary members in usageDetails. Your proposal seems reasonable, but we have concerns about what it would look like when deprecating a feature in the future. Moving it’s usage to a deprecated dictionary is in itself a web-breaking change.

@asutherland @aliams - The primary goal of this change is to aid in debugging. For example, one can consider an email client that uses IndexedDB to store text and Cache Storage to store attachments. With the proposed change, said app would be able to debug problematic storage scenarios: high Cache Storage usage but low IndexedDB usage would suggest the app forgot to delete attachments when evicting messages from the local cache. If there was high usage for both storage systems, this would mean the app is caching too many messages, suggesting the eviction policy is not behaving correctly.

While we’d like to address the all-or-nothing eviction issue, any changes around the single temporary bucket are out of scope for this particular proposal.

@ralphch0 Thank you for providing your use case! We’ll be sure to keep this in mind as we move forward.

jarryd999 avatar Nov 14 '18 21:11 jarryd999

My concern about this change is that while it's trivial to implement, it seems biased towards consumers who have exclusive control over the data stored on their origin and use a data storage regime that very cleanly uses different storage APIs for mutually exclusive purposes, like gmail. If you store everything in IndexedDB or have other JS code-bases on your origin, it doesn't seem useful. Which suggests we will eventually want something more sophisticated, but then we still have the specification and developer documentation burden of this enhancement.

asutherland avatar Nov 14 '18 23:11 asutherland

@asutherland - We do expect that, in the future, consumers will have more buckets than the default bucket. We see the proposed change as a complement to said future rather than a redundancy. Each bucket will still have access to all the storage backends, so we will still want an API that, given a bucket, will return the per system usage details.

Today’s apps are storing everything in the default bucket, so they can still make use of a per system breakdown of usage. While you are correct in saying that this change is not very useful for apps that only use IndexedDB, we believe that there are enough people who find this change useful that we would like to develop a solution for this use case.

jarryd999 avatar Nov 16 '18 00:11 jarryd999

@jarryd999 If the plan going forward is more buckets, I agree that this approach makes sense and works. I also like the idea of formalizing the values via explicit dictionary keys with the zero values not being present and therefore not exposed to JS.

As an aside, Firefox is moving how we store localStorage and so we would plan to expose it via this mechanism. Also, our underlying storage for ServiceWorkers is via Cache API in a separate namespace, but it makes sense to break that out separately from content's use of Cache API so we would do that too.

asutherland avatar Nov 16 '18 18:11 asutherland

@asutherland - re:

Firefox is moving how we store localStorage

What's your timeframe for that, and do you have docs? (Interested in questions like: will there still be a localStorage-specific limit, etc)

inexorabletash avatar Nov 16 '18 18:11 inexorabletash

Based on the comments above, I would like to recap what we're thinking right now.

dictionary StorageEstimate {
  unsigned long long usage;
  unsigned long long quota;
  StorageUsageDetails usageDetails;
};

dictionary StorageUsageDetails {
  unsigned long long indexedDB;
  unsigned long long caches;
  unsigned long long serviceWorkerRegistrations;
};

The StorageUsageDetails dictionary will only contain members that have a usage greater than 0.

Note: Chrome also intends to ship with an applicationCache member of the StorageUsageDetails dictionary.

Thank you to @aliams @annevk @asutherland @domenic @inexorabletash @ralphch0 for your feedback.

jarryd999 avatar Nov 27 '18 01:11 jarryd999