bigquery icon indicating copy to clipboard operation
bigquery copied to clipboard

Reduce size of Lighthouse payload

Open rviscomi opened this issue 7 years ago • 2 comments

The latest lighthouse.2018_10_15 table is 237 GB. Querying all lighthouse tables currently costs 4.15 TB and runs in several minutes.

image

  1. identify parts of the JSON payload that are unnecessary or unlikely to have analytical value and also significant contributors to the payload size
  2. modify the Dataflow pipeline to omit these parts of the payload
  3. profit

rviscomi avatar Nov 08 '18 19:11 rviscomi

Can you point to where this trimming could be done?

connorjclark avatar Feb 03 '21 20:02 connorjclark

Hey @connorjclark the get_lighthouse_reports function in the Dataflow pipeline would be the place where we can trim off excess response data

https://github.com/HTTPArchive/bigquery/blob/acef15add27f0ba360fba44e2b74ab2575baed46/dataflow/python/bigquery_import.py#L188-L222

rviscomi avatar Apr 15 '21 18:04 rviscomi