Update the "latest" tables from Dataflow
Forked from #76
Currently we use scheduled queries to scan each dataset/client combo for the latest release and save that to its respective latest.<dataset>_<client> table.
For example, here's the scheduled query that generates the latest.response_bodies_mobile table:
#standardSQL
SELECT
*
FROM
`httparchive.response_bodies.*`
WHERE
ENDS_WITH(_TABLE_SUFFIX, 'mobile') AND
SUBSTR(_TABLE_SUFFIX, 0, 10) = (
SELECT
SUBSTR(table_id, 0, 10) AS date
FROM
`httparchive.response_bodies.__TABLES_SUMMARY__`
ORDER BY
table_id DESC
LIMIT
1)
BigQuery usually has some heuristics to help minimize the number of bytes processed by a query if the WHERE clause clearly limits the _TABLE_SUFFIX pseudocolumn to a particular table. But I'm not sure if that's happening here because the estimated cost of this query is over $1000 (200 TB): This query will process 202.9 TB when run..
Queries for each dataset/client combo are scheduled to run on the first couple of days of every month. They become more expensive over time as we add new tables to every dataset.
A much more efficient approach would be to overwrite the latest.* tables in the Dataflow pipeline when we create the tables for each release. Rather than updating the deprecated Java pipeline, add this as a feature to #79.