legacy.httparchive.org icon indicating copy to clipboard operation
legacy.httparchive.org copied to clipboard

Bad desktop data dumps for Jan.

Open ebidel opened this issue 8 years ago • 10 comments

Both desktop data dumps for this month (2017-01-01 and 2017-01-15) are showing malformed data. Known issue?

http://httparchive.org/interesting.php

ebidel avatar Jan 30 '17 05:01 ebidel

Hmm, no that's something we need to investigate. Thanks for the heads up Eric.

/cc @pmeenan @rviscomi

igrigorik avatar Jan 30 '17 06:01 igrigorik

December 2016 looks anomalous also—sudden dramatic drop in overall weight vs. the previous month (if only it were true!).

ronancremin avatar Feb 03 '17 15:02 ronancremin

We had an issue with the requests database where the primary key ran out of 32-bit numbers - doh. It should be fixed for the 2/1 crawl and we're looking at backfilling the December and January crawl stats from the HARs in bigquery.

pmeenan avatar Feb 03 '17 16:02 pmeenan

Thanks Patrick! I was looking for evidence of responsive images in WordPress 4.4 hopefully pulling down the average size as it rolls out.

ronancremin avatar Feb 03 '17 16:02 ronancremin

How come the errors for numDomains weren't being flagged as constraint violations?

Themanwithoutaplan avatar Feb 13 '17 11:02 Themanwithoutaplan

I need to learn more about the BigQuery -> MySQL pipeline, but I hope to get this fixed soon.

rviscomi avatar Mar 28 '17 01:03 rviscomi

See also this comment from #116:

The downloads page lists January 2017 but the links are broken. The thing is that the dumps were available at the time and contained valid data. Can they be recreated?

rviscomi avatar Feb 04 '19 23:02 rviscomi

Which links specifically? The desktop links to the archived dumps on archive.org are all working for me.

http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.gz http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.csv.gz http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.gz http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.csv.gz

Is the problem that an automated script is trying to use the pre-archive location and it moved once the archiving completed? For the pipeline, would it be easier if a copy of the dumps was also archived to the cloud storage bucket?

pmeenan avatar Feb 04 '19 23:02 pmeenan

Sorry this is an old issue from 2017 that I updated. Was triaging old issues.

rviscomi avatar Feb 04 '19 23:02 rviscomi

Whoops. My bad. Dumps from 2 years ago? If the links don't work they're gone.


From: Rick Viscomi [email protected] Sent: Monday, February 4, 2019 6:56 PM To: HTTPArchive/legacy.httparchive.org Cc: Patrick Meenan; Mention Subject: Re: [HTTPArchive/legacy.httparchive.org] Bad desktop data dumps for Jan. (#74)

Sorry this is an old issue from 2017 that I updated. Was triaging old issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/HTTPArchive/legacy.httparchive.org/issues/74#issuecomment-460461223, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAbHBdm6b2vWs1eei8ZHqIASlLzykdXfks5vKMjHgaJpZM4LxDxq.

pmeenan avatar Feb 05 '19 00:02 pmeenan