bigquery
bigquery copied to clipboard
Create and maintain a 10k-row subset table
Suggested in the HTTP Archive Slack channel:
Was wondering if it makes sense to add a "sample" dataset that contains data for the first ~1000 pages. This way you can easily test out a query on
httparchive.latest.response_bodies_desktop
using something smaller likehttparchive.sample.response_bodies_desktop
. I manually create sample datasets for the same reason when working with the larger tables.
having an official 10K subset would make this process cheaper for non-Google folks, and would make it feasible to create an occasional query without hitting the free plan limits
Just need to figure out which tables to subset, how to organize them, and how to keep them updated with the latest release.
How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"?
LIMIT 10000 can still cause a full scan of all rows with some queries, thus cost substantially more. Having a smaller table reduces that cost, allowing for developing/iterating on the query.
It’s not clear to me if you can use a view with LIMIT 10000 to achieve this and have the billing only report for the 10K rows.
On Wed, Mar 21, 2018 at 5:36 AM Ilya Grigorik [email protected] wrote:
How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/bigquery/issues/27#issuecomment-374838109, or mute the thread https://github.com/notifications/unsubscribe-auth/ABayJ-v5wTBQvu6HP1CQS8Q7W27hnwumks5tgebegaJpZM4SzBi3 .
I tested out a VIEW with a a LIMIT 1000
clause in it. It still caused a full scan of all rows in the table that the VIEW was based on.
As for how to define first, I've using the top N based on the Alexa rank . WHERE rank > 0 LIMIT 1000
For example my pages_mobile sample query reduced the 342MB table to 765KB:
SELECT *
FROM `httparchive.runs.2018_02_15_pages_mobile`
WHERE rank > 0
ORDER BY rank ASC
LIMIT 1000
Then create other tables based on the pageid
column in this dataset, or the url
for the HAR datasets or the NET.HOSTNAME(url)
match for CrUX datasets.
This request sample table uses the following query, which processes 43.3GB and results in a 132MB table that I can use to test JOINs with the pages table later on :
SELECT *
FROM `httparchive.runs.2018_02_15_requests_mobile`
WHERE pageid IN (
SELECT pageid
FROM `httparchive_sampleset.pages_mobile`
)
And this sample table creates a Lighthouse version of it, reducing the 115GB table down to 280MB.
SELECT *
FROM `httparchive.lighthouse.2018_02_15_mobile`
WHERE url IN (
SELECT url
FROM `httparchive_sampleset.pages_mobile`
)
AND report IS NOT NULL
Creating the sample tables for the HAR and Lighthouse datasets will help reduce processing overhead on developing queries more than the pages/requests tables. But we should still include them so that we can test JOINs across these tables.
WDYT of just having a random sample? This is future-proof if we're ever moving away from Alexa ranks and possibly more representative of the full dataset than just the head URLs.
I think this is a great idea. I have had a lot of hesitation digging into the Lighthouse data - knowing each query would eat 13% of my free 1TB. Letting me refine my queries on a 280 MB ( or even a couple GB) table would lower the barrier to entry.
I created a few sample tables here. These use the queries I mentioned above for the alexa top 1000.
A random sample makes sense too - but we would need to make sure that the URLs are consistent across sample tables so that JOINs can be tested properly
I didn't create them for every table yet, but this is a start. I'm happy to share this data set more broadly, but I think it's better off living under the HTTP Archive. Let me know if there is anything I can do to help with this!
@paulcalvano In the short term any chance you can recreate these with the new structure tables?
Longer term if we want people to query the raw data then I think maintaining these in the public set is a must.
As it stands it's just too expensive to query the main tables, and it's only going to get more expensive as the HTTP Archive grows.
For example this seemingly innocuous query processes 708GB!
select page, extractLinkHeaders(JSON_EXTRACT(payload, '$.response.headers'))
from `httparchive.requests.2018_10_15_mobile`
where url = "https://starpowertalent.com/";
(Using SQL as the query language makes BigQuery accessible but it also 'tricked' me to think about processing in a relational DB way which of course isn't how it works - for a while was confused as to why the where clause didn't limit the bytes processed)
It seems that things that want to query the response bodies with a regexp would benefit a lot from this as regexps are frequently a little tricky and you'd like to be able to test and make mistakes with a much smaller dataset than even the oldest/smallest ones I can find.
1k and 10k tables created for the web almanac -
https://github.com/HTTPArchive/almanac.httparchive.org/issues/34?#issuecomment-502738567