bigquery
bigquery copied to clipboard
Reduce size of Lighthouse payload
The latest lighthouse.2018_10_15 table is 237 GB. Querying all lighthouse tables currently costs 4.15 TB and runs in several minutes.

- identify parts of the JSON payload that are unnecessary or unlikely to have analytical value and also significant contributors to the payload size
- modify the Dataflow pipeline to omit these parts of the payload
- profit
Can you point to where this trimming could be done?
Hey @connorjclark the get_lighthouse_reports function in the Dataflow pipeline would be the place where we can trim off excess response data
https://github.com/HTTPArchive/bigquery/blob/acef15add27f0ba360fba44e2b74ab2575baed46/dataflow/python/bigquery_import.py#L188-L222