Update the "latest" tables from Dataflow

Open rviscomi opened this issue 5 years ago • 0 comments

Forked from #76

Currently we use scheduled queries to scan each dataset/client combo for the latest release and save that to its respective latest.<dataset>_<client> table.

For example, here's the scheduled query that generates the latest.response_bodies_mobile table:

#standardSQL
SELECT
  *
FROM
  `httparchive.response_bodies.*`
WHERE
  ENDS_WITH(_TABLE_SUFFIX, 'mobile') AND
  SUBSTR(_TABLE_SUFFIX, 0, 10) = (
  SELECT
    SUBSTR(table_id, 0, 10) AS date
  FROM
    `httparchive.response_bodies.__TABLES_SUMMARY__`
  ORDER BY
    table_id DESC
  LIMIT
    1)

BigQuery usually has some heuristics to help minimize the number of bytes processed by a query if the WHERE clause clearly limits the _TABLE_SUFFIX pseudocolumn to a particular table. But I'm not sure if that's happening here because the estimated cost of this query is over $1000 (200 TB): This query will process 202.9 TB when run..

Queries for each dataset/client combo are scheduled to run on the first couple of days of every month. They become more expensive over time as we add new tables to every dataset.

A much more efficient approach would be to overwrite the latest.* tables in the Dataflow pipeline when we create the tables for each release. Rather than updating the deprecated Java pipeline, add this as a feature to #79.

Jun 03 '20 21:06 rviscomi