httparchive.org icon indicating copy to clipboard operation
httparchive.org copied to clipboard

Create a random sample set for the latest of each BigQuery dataset

Open rviscomi opened this issue 7 years ago • 0 comments

The BQ tables are getting unwieldy and expensive to query. We use scheduled queries to generate the tables in the latest dataset. Similarly, we should generate a subset of these tables randomly limited to some number of rows to ensure that the query will be inexpensive. For the requests dataset we should group by page so that all pages in the sample have all of their respective requests.

  • [ ] calculate the average row size in bytes for each dataset
  • [ ] pick a sample size in rows corresponding to about 1 GB per dataset (this can change)
  • [ ] schedule a query for each dataset (requests, pages, etc) and each client (desktop, mobile) to materialize sample tables

rviscomi avatar Jan 14 '19 06:01 rviscomi