legacy.httparchive.org
legacy.httparchive.org copied to clipboard
Bad desktop data dumps for Jan.
Both desktop data dumps for this month (2017-01-01 and 2017-01-15) are showing malformed data. Known issue?
http://httparchive.org/interesting.php
Hmm, no that's something we need to investigate. Thanks for the heads up Eric.
/cc @pmeenan @rviscomi
December 2016 looks anomalous also—sudden dramatic drop in overall weight vs. the previous month (if only it were true!).
We had an issue with the requests database where the primary key ran out of 32-bit numbers - doh. It should be fixed for the 2/1 crawl and we're looking at backfilling the December and January crawl stats from the HARs in bigquery.
Thanks Patrick! I was looking for evidence of responsive images in WordPress 4.4 hopefully pulling down the average size as it rolls out.
How come the errors for numDomains
weren't being flagged as constraint violations?
I need to learn more about the BigQuery -> MySQL pipeline, but I hope to get this fixed soon.
See also this comment from #116:
The downloads page lists January 2017 but the links are broken. The thing is that the dumps were available at the time and contained valid data. Can they be recreated?
Which links specifically? The desktop links to the archived dumps on archive.org are all working for me.
http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.gz http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_pages.csv.gz http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.gz http://www.archive.org/download/httparchive_downloads_Jan_1_2019/httparchive_Jan_1_2019_requests.csv.gz
Is the problem that an automated script is trying to use the pre-archive location and it moved once the archiving completed? For the pipeline, would it be easier if a copy of the dumps was also archived to the cloud storage bucket?
Sorry this is an old issue from 2017 that I updated. Was triaging old issues.
Whoops. My bad. Dumps from 2 years ago? If the links don't work they're gone.
From: Rick Viscomi [email protected] Sent: Monday, February 4, 2019 6:56 PM To: HTTPArchive/legacy.httparchive.org Cc: Patrick Meenan; Mention Subject: Re: [HTTPArchive/legacy.httparchive.org] Bad desktop data dumps for Jan. (#74)
Sorry this is an old issue from 2017 that I updated. Was triaging old issues.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/HTTPArchive/legacy.httparchive.org/issues/74#issuecomment-460461223, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAbHBdm6b2vWs1eei8ZHqIASlLzykdXfks5vKMjHgaJpZM4LxDxq.