bigquery icon indicating copy to clipboard operation
bigquery copied to clipboard

Create and maintain a 10k-row subset table

Open rviscomi opened this issue 6 years ago • 10 comments

Suggested in the HTTP Archive Slack channel:

Was wondering if it makes sense to add a "sample" dataset that contains data for the first ~1000 pages. This way you can easily test out a query on httparchive.latest.response_bodies_desktop using something smaller like httparchive.sample.response_bodies_desktop. I manually create sample datasets for the same reason when working with the larger tables.

having an official 10K subset would make this process cheaper for non-Google folks, and would make it feasible to create an occasional query without hitting the free plan limits

Just need to figure out which tables to subset, how to organize them, and how to keep them updated with the latest release.

rviscomi avatar Mar 21 '18 05:03 rviscomi

How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"?

igrigorik avatar Mar 21 '18 05:03 igrigorik

LIMIT 10000 can still cause a full scan of all rows with some queries, thus cost substantially more. Having a smaller table reduces that cost, allowing for developing/iterating on the query.

It’s not clear to me if you can use a view with LIMIT 10000 to achieve this and have the billing only report for the 10K rows.

On Wed, Mar 21, 2018 at 5:36 AM Ilya Grigorik [email protected] wrote:

How do we define ~first? Also, what's the benefit of this vs "LIMIT 10000"?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/bigquery/issues/27#issuecomment-374838109, or mute the thread https://github.com/notifications/unsubscribe-auth/ABayJ-v5wTBQvu6HP1CQS8Q7W27hnwumks5tgebegaJpZM4SzBi3 .

sleevi avatar Mar 21 '18 08:03 sleevi

I tested out a VIEW with a a LIMIT 1000 clause in it. It still caused a full scan of all rows in the table that the VIEW was based on.

paulcalvano avatar Mar 21 '18 11:03 paulcalvano

As for how to define first, I've using the top N based on the Alexa rank . WHERE rank > 0 LIMIT 1000

For example my pages_mobile sample query reduced the 342MB table to 765KB:

SELECT *
FROM `httparchive.runs.2018_02_15_pages_mobile` 
WHERE rank > 0
ORDER BY rank ASC
LIMIT 1000 

Then create other tables based on the pageid column in this dataset, or the url for the HAR datasets or the NET.HOSTNAME(url) match for CrUX datasets.

This request sample table uses the following query, which processes 43.3GB and results in a 132MB table that I can use to test JOINs with the pages table later on :

SELECT * 
FROM `httparchive.runs.2018_02_15_requests_mobile`  
WHERE pageid IN (
  SELECT pageid 
  FROM `httparchive_sampleset.pages_mobile`
)

And this sample table creates a Lighthouse version of it, reducing the 115GB table down to 280MB.

SELECT * 
FROM `httparchive.lighthouse.2018_02_15_mobile`   
WHERE url IN (
  SELECT url 
  FROM `httparchive_sampleset.pages_mobile`
)
AND report IS NOT NULL 

Creating the sample tables for the HAR and Lighthouse datasets will help reduce processing overhead on developing queries more than the pages/requests tables. But we should still include them so that we can test JOINs across these tables.

paulcalvano avatar Mar 21 '18 12:03 paulcalvano

WDYT of just having a random sample? This is future-proof if we're ever moving away from Alexa ranks and possibly more representative of the full dataset than just the head URLs.

rviscomi avatar Mar 21 '18 20:03 rviscomi

I think this is a great idea. I have had a lot of hesitation digging into the Lighthouse data - knowing each query would eat 13% of my free 1TB. Letting me refine my queries on a 280 MB ( or even a couple GB) table would lower the barrier to entry.

dougsillars avatar Apr 03 '18 23:04 dougsillars

I created a few sample tables here. These use the queries I mentioned above for the alexa top 1000.

A random sample makes sense too - but we would need to make sure that the URLs are consistent across sample tables so that JOINs can be tested properly

I didn't create them for every table yet, but this is a start. I'm happy to share this data set more broadly, but I think it's better off living under the HTTP Archive. Let me know if there is anything I can do to help with this!

paulcalvano avatar Apr 05 '18 14:04 paulcalvano

@paulcalvano In the short term any chance you can recreate these with the new structure tables?

Longer term if we want people to query the raw data then I think maintaining these in the public set is a must.

As it stands it's just too expensive to query the main tables, and it's only going to get more expensive as the HTTP Archive grows.

For example this seemingly innocuous query processes 708GB!

select page, extractLinkHeaders(JSON_EXTRACT(payload, '$.response.headers'))
from `httparchive.requests.2018_10_15_mobile`
where url = "https://starpowertalent.com/";

(Using SQL as the query language makes BigQuery accessible but it also 'tricked' me to think about processing in a relational DB way which of course isn't how it works - for a while was confused as to why the where clause didn't limit the bytes processed)

andydavies avatar Nov 14 '18 11:11 andydavies

It seems that things that want to query the response bodies with a regexp would benefit a lot from this as regexps are frequently a little tricky and you'd like to be able to test and make mistakes with a much smaller dataset than even the oldest/smallest ones I can find.

bkardell avatar Mar 31 '19 19:03 bkardell

1k and 10k tables created for the web almanac -

https://github.com/HTTPArchive/almanac.httparchive.org/issues/34?#issuecomment-502738567

paulcalvano avatar Jun 17 '19 17:06 paulcalvano