httparchive.org icon indicating copy to clipboard operation
httparchive.org copied to clipboard

Adding Firefox use counter data to HTTPArchive

Open zcorpan opened this issue 2 years ago • 8 comments

Hey folks,

In https://github.com/HTTPArchive/legacy.httparchive.org/issues/59 Chrome's use counter data was added to httparchive.

The relevant code in this repo seems to be https://github.com/HTTPArchive/data-pipeline/blob/41fe511951797d25cebc71097726c8b65497212b/modules/import_all.py#L146

I'd like to have Firefox use counter data also be available in httparchive. (Maybe more things also, but starting with use counters.) To that end, I've filed https://bugzilla.mozilla.org/show_bug.cgi?id=1813593 so that the data can be extracted locally.

Are there considerations we should know about for this to work?

cc @emilio @janodvarko

zcorpan avatar Jan 31 '23 13:01 zcorpan

We only run tests in Chrome, so I don't think this would be feasible. @pmeenan WDYT?

rviscomi avatar Jan 31 '23 13:01 rviscomi

Why not? We'd need to also run tests in desktop Firefox, in addition to current desktop Chrome and Android Chrome. I don't think it's necessarily useful to collect and store everything for Firefox that is currently stored for Chrome, that would increase storage with 50%. But only use counter data seems negligible for storage.

zcorpan avatar Jan 31 '23 14:01 zcorpan

It’s not just storage. It’s also crawl capacity.

tunetheweb avatar Jan 31 '23 14:01 tunetheweb

As it stands right now, it takes ~25,000 VM's the better part of a week to collect the data. Technically it is pretty easy to support but financially it would increase the running costs by ~30% (assuming we'd only run one config). I'm guessing some form of additional sponsorship would be needed to cover the costs.

pmeenan avatar Jan 31 '23 14:01 pmeenan

OK, thanks. What would that amount to in USD?

zcorpan avatar Jan 31 '23 14:01 zcorpan

30% of our current crawl expenses would come out to about $20k per month.

rviscomi avatar Jan 31 '23 14:01 rviscomi

That is likely more than the value Mozilla would get from the data. 🙂

For web compat analysis, the sample_data URLs (10k pages) would still be better than nothing. Assuming a full run would be 12,500,000 URLs (httparchive.pages.2023_01_01_desktop has 12,647,566 rows), 10k pages would be 0.08% of the cost, which is .... $16.

Would it be feasible to start there?

zcorpan avatar Jan 31 '23 15:01 zcorpan

Update: https://bugzilla.mozilla.org/show_bug.cgi?id=1813593 is now fixed (thanks @emilio!). It's possible to set these prefs to log use counter data to stderr:

  • dom.use_counters.dump.document
  • dom.use_counters.dump.worker
  • dom.use_counters.dump.page

For the purpose of this issue, using page and worker but not document makes most sense. (document is for each document, including e.g. SVGs; these accumulate into page which is per top-level page.) worker use counters don't accumulate into page so need to be included separately.

The logged output looks like this:

USE_COUNTER_PAGE: USE_COUNTER2_DOCUMENTOPEN_PAGE - http://software.hixie.ch/utilities/js/live-dom-viewer/
USE_COUNTER_PAGE: USE_COUNTER2_CSS_PROPERTY_Display_PAGE - http://software.hixie.ch/utilities/js/live-dom-viewer/
USE_COUNTER_PAGE: USE_COUNTER2_CSS_PROPERTY_FontStyle_PAGE - http://software.hixie.ch/utilities/js/live-dom-viewer/
USE_COUNTER_PAGE: USE_COUNTER2_CSS_PROPERTY_FontWeight_PAGE - http://software.hixie.ch/utilities/js/live-dom-viewer/

You need to close the page for some of the use counters to be added to the log.

zcorpan avatar Feb 06 '23 19:02 zcorpan

No development in the last 2 years. Closing as outdated.

max-ostapenko avatar Jul 14 '25 18:07 max-ostapenko