httparchive.org Desktop Summary Requests incomplete

February 2019 Summary Requests table is 272M lines, and 240GB

March 2019 Summary Requests table is 5M lines, 5GB

It appears a large amount of data is missing in the March data. The raw data files at https://legacy.httparchive.org/downloads.php are also different in size.

Mar 26 '19 17:03 dougsillars

Confirmed that the downloads are serving a file way too small

Next step is to try to rerun the mysqldump

https://github.com/HTTPArchive/legacy.httparchive.org/blob/9ef583089600d05093c4992a0c92e77f00c26ae8/bulktest/update.php#L214

Mar 26 '19 18:03 rviscomi

The local mysql tables seem to have been cleared out with the exception of the requests table:

mysql> select count(0) from requests;
+----------+
| count(0) |
+----------+
|  5119678 |
+----------+
1 row in set (0.00 sec)

mysql> select count(0) from requestsdev;
+----------+
| count(0) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> select count(0) from requestsmobile;
+----------+
| count(0) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> select count(0) from requestsmobiledev;
+----------+
| count(0) |
+----------+
|        0 |
+----------+
1 row in set (0.01 sec)

The requests that are in that table are only from tests on March 1:

mysql> select min(startedDateTime), max(startedDateTime) from requests;
+----------------------+----------------------+
| min(startedDateTime) | max(startedDateTime) |
+----------------------+----------------------+
|           1551418211 |           1551432347 |
+----------------------+----------------------+
1 row in set (0.00 sec)

So this is why the mysqldump of the requests table is only yielding 647 MB of data.

Not sure what happened to the requests table to cut it short and why only desktop was affected. Also not sure if we have any other backups available. The good news is that we do have the HAR files for all of these requests so it's not a total loss of data, but we would still need to convert the HAR data to the schema in the CSV-based summary tables. This is doable but would require some time. This task is also something that's been on our todo list as part of the mysql deprecation. See https://github.com/HTTPArchive/httparchive.org/issues/23

I'm still mildly concerned that this is a problem that might happen again, so it's best to keep an eye on the April crawl, especially around the 15th of the month when @pmeenan noticed a suspicious drop in disk space.

Mar 28 '19 16:03 rviscomi

FWIW, the requests tables get dropped after the mysqldump completes so it's not unusual for them to be empty after the crawl but it looks like something triggered it mid-crawl for the desktop data :(

Mar 28 '19 18:03 pmeenan

Yeah it seems something nuked the table before we could do our backups. That said, I'm curious how we ended up with a partial requests table if it's supposed to be dropped after each mysqldump.

Here's how it should work. There's a cron job to run batch_process.php every 30 minutes. batch_process will kick off the mysqldump when the crawl is complete:

https://github.com/HTTPArchive/legacy.httparchive.org/blob/7a5710dc83dd4ca7bb204573fd3fa58c5ea2c1f0/bulktest/batch_process.php#L41-L82

https://github.com/HTTPArchive/legacy.httparchive.org/blob/7a5710dc83dd4ca7bb204573fd3fa58c5ea2c1f0/bulktest/copy.php#L96-L100

https://github.com/HTTPArchive/legacy.httparchive.org/blob/6d1a872a3270360a14eb018871544a0c9c8adf28/crawls.inc#L285-L332

Mar 28 '19 19:03 rviscomi

Reassigning to Paul, he's got a conversion sheet going to recreate the summary requests data.

Apr 27 '19 12:04 rviscomi

Paul and I made lots of progress on this. Here's a table with the summary_requests schema generated from the HARs: https://bigquery.cloud.google.com/table/httparchive:scratchspace.requests_2019_04_01_desktop?tab=preview

Would appreciate another set of eyes to make sure the results look good.

May 04 '19 19:05 rviscomi

Here's the query that powers it:

https://gist.github.com/rviscomi/52494fdcfa561c88cfb1c4255ce3939d

May 04 '19 19:05 rviscomi

Noticed today that the summary_pages tables are off as well. Metrics like the total font size are calculated based on the underlying requests, so in the absence of those the summary page data becomes 0.

We'll need to write a query that aggregates requests for each page and computes the summary stats.

May 10 '19 05:05 rviscomi

November_1_2019 desktop summary requests table is also missing ~90% of the requests.

Number of total requests in...

Sep 2019: 410,426,130
Oct 2019: 407,445,152
Nov 2019: 44,531,993

(as found by SELECT count (0) FROM httparchive:summary_requests.2019_11_01_desktop)

The gzipped archives in the buckets also look incomplete: https://console.cloud.google.com/storage/browser/httparchive/Nov_1_2019/?pli=1

Thanks for looking into this.

Dec 09 '19 22:12 gunesacar

I think the November 2019 desktop requests CSV got corrupted somehow. I'm unable to regenerate the table without it failing to import into BQ. Leaving this issue open as a reminder to either resolve the CSV issue or generate the table via HAR data.

Jul 07 '20 22:07 rviscomi

httparchive.org httparchive.org copied to clipboard

Desktop Summary Requests incomplete

httparchive.org
httparchive.org copied to clipboard