httparchive.org icon indicating copy to clipboard operation
httparchive.org copied to clipboard

Tech report: category-level aggregations

Open rviscomi opened this issue 1 year ago • 1 comments

Use case: when comparing technologies within the same category, it can be useful to know how they all compare to some kind of category-level aggregation over all pages within the category.

Mockup: image

The blue line represents an aggregation of all pages within the CMS category, so a user can see how it compares to specific technologies within that category. It could also be possible to compare entire categories.

The technical implementation could look something like this:

  • update the technologies table schema to include a field indicating whether the row pertains to a technology or a category aggregation
    • all dimensions supported: rank, client, geo
    • backfill all historical data
  • provide a param in the API endpoints to distinguish between the two, only returning data for the selected aggregation type (default: technology)
  • add categories to the UI, similar to the special "ALL" technology

In terms of the schema changes, we currently have the following fields:

  • date (2024-08-01)
  • geo (ALL)
  • rank (ALL)
  • category (CMS)
  • app (WordPress)
  • client (desktop)
  • [stats] where each field is aggregated over the set of pages that use WordPress for the given dimensions

The updated schema would look something like this for the CMS-level aggregation:

  • date (2024-08-01)
  • type (category)
  • geo (ALL)
  • rank (ALL)
  • category (CMS)
  • app (All CMSs)
  • client (desktop)
  • [stats] where each field is aggregated over the set of pages that one or more CMS for the given dimensions

Calculating category-level data based on technology-level aggregations won't work because percentiles cannot accurately be aggregated together. At best we'd be able to do a weighted average of the medians, but this would also not solve the issue of deduplicating origins that appear multiple times in a category because they use multiple technologies. For example, jQuery UI is always used with jQuery within the JS libraries category, but those websites would be counted twice. So the implementation would need to process the raw origin-level data.

rviscomi avatar Sep 04 '24 18:09 rviscomi

We decided to put this feature in the backlog for now, given the complexity of the implementation and relatively low value it'd bring to the UX. If anyone feels strongly about it, feel free to add your 👍 to the comment above.

rviscomi avatar Sep 04 '24 19:09 rviscomi