bigquery icon indicating copy to clipboard operation
bigquery copied to clipboard

Making the HTTP2 query cheaper

Open tunetheweb opened this issue 3 years ago • 2 comments

We have a HTTP/2 requests graph which does a look up on the $_protocol field in the requests.payload column. This currently costs 211TB and costs an estimated $1,058 (yes - one thousand bucks!!!) and counting, to run and is re-run every month. Which is quite frankly ridiculous. It also takes forever to run and sometimes times out.

I wanted to add an HTTP/3 graph since it's getting out there but can't justify doubling that cost! While our generous benefactor may be able to absorb that, others can't, and I think we should be setting a better examples here.

If we use the summary_requests table and use the reqHttpVersion, or respHttpVersion (or both!) then the cost plummets to 363GB and or an estimated $1.77!!! And the data looks pretty similar (not exactly the same as requests and summary_requests look to have slight differences in number of rows, but close enough).

However, there is an issue as these fields had bad data for a long time (relevant WPT issue and was only fixed from October 2020. I would prefer to track the growth longer than that and ideally back to 2015 when HTTP/2 was launched.

So we've a few choices:

  1. Fix up the bad data. Ideally we'd join requests to summary_requests and update the bad reqHttpVersion, or respHttpVersion values to the $._protocol field but can't figure out how to do that.
  2. Patch the bad data by saying ori:, us:, od:, me: or : / values are effectively HTTP/2. This isn't always the case and there are a small number of HTTP/1.1 connections which give those values, but it's close enough and a lot easier to run this clean up than option 1 (unless there is a way to join these two tables I'm not seeing?).
  3. Have a hacky SQL (see below) to patch it in the query instead. Seems a bit of a hack.
  4. Add the protocol column to summary_requests table and backfill all the old values. Seems like quite an effort.
  5. Wait until we reorganised the tables like we've talked about.
  6. Leave as is and just implement HTTP/3 query in cheaper manner.

Thoughts?

#standardSQL
SELECT
  SUBSTR(_TABLE_SUFFIX, 0, 10) AS date,
  UNIX_DATE(CAST(REPLACE(SUBSTR(_TABLE_SUFFIX, 0, 10), '_', '-') AS DATE)) * 1000 * 60 * 60 * 24 AS timestamp,
  IF(ENDS_WITH(_TABLE_SUFFIX, 'desktop'), 'desktop', 'mobile') AS client,
  ROUND(SUM(IF(respHttpVersion = 'HTTP/2'
               OR respHttpVersion = 'ori' -- bad value that mostly means HTTP/2 (parsed incorrectly from :authority:)
               OR respHttpVersion = 'us:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :status:)
               OR respHttpVersion = 'od:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :method:)
               OR respHttpVersion = 'me:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :scheme:)
               OR respHttpVersion = ': /' -- bad value that mostly means HTTP/2 (parsed incorrectly from :path:)
               OR reqHttpVersion = 'HTTP/2'
               OR reqHttpVersion = 'ori' -- bad value that mostly means HTTP/2 (parsed incorrectly from :authority:)
               OR reqHttpVersion = 'us:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :status:)
               OR reqHttpVersion = 'od:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :method:)
               OR reqHttpVersion = 'me:' -- bad value that mostly means HTTP/2 (parsed incorrectly from :scheme:)
               OR reqHttpVersion = ': /' -- bad value that mostly means HTTP/2 (parsed incorrectly from :path:)
             , 1, 0)) * 100 / COUNT(0), 2) AS percent
FROM
  `httparchive.summary_requests.*`
GROUP BY
  date,
  timestamp,
  client
ORDER BY
  date DESC,
  client

Here's the comparison of what that comes back with compared to the current production site:

date timestamp client percent curr_pct diff
2021_05_01 1.6198E+12 desktop 64.55 64.8 0.25
2021_05_01 1.6198E+12 mobile 64.96 65.3 0.34
2021_04_01 1.6172E+12 desktop 68.46 68.6 0.14
2021_04_01 1.6172E+12 mobile 67.47 67.6 0.13
2021_03_01 1.6146E+12 desktop 68.5 68.6 0.1
2021_03_01 1.6146E+12 mobile 68.15 68.3 0.15
2021_02_01 1.6121E+12 desktop 68.15 68.3 0.15
2021_02_01 1.6121E+12 mobile 68.05 68.2 0.15
2021_01_01 1.6095E+12 desktop 67.19 67.3 0.11
        67.5 67.5
2020_12_01 1.6068E+12 desktop 66.75 66.9 0.15
2020_12_01 1.6068E+12 mobile 67.11 67.3 0.19
2020_11_01 1.6042E+12 desktop 65.95 66.1 0.15
2020_11_01 1.6042E+12 mobile 66.24 66.4 0.16
2020_10_01 1.6015E+12 desktop 65.57 65.7 0.13
2020_10_01 1.6015E+12 mobile 65.46 65.6 0.14
2020_09_01 1.5989E+12 desktop 63.52 64.8 1.28
2020_09_01 1.5989E+12 mobile 65.61 64.9 -0.71
2020_08_01 1.5962E+12 desktop 62.53 63.7 1.17
2020_08_01 1.5962E+12 mobile 65.09 63.8 -1.29
2020_07_01 1.5936E+12 desktop 62.23 64.2 1.97
2020_07_01 1.5936E+12 mobile 64.43 64.2 -0.23
2020_06_01 1.591E+12 desktop 61.46 64.4 2.94
2020_06_01 1.591E+12 mobile 62.34 64.5 2.16
2020_05_01 1.5883E+12 desktop 60.63 63.4 2.77
2020_05_01 1.5883E+12 mobile 61.79 63.8 2.01
2020_04_01 1.5857E+12 desktop 59.6 62.2 2.6
2020_04_01 1.5857E+12 mobile 60.6 62.4 1.8
2020_03_01 1.583E+12 desktop 59.79 62.3 2.51
2020_03_01 1.583E+12 mobile 60.68 62.5 1.82
2020_02_01 1.5805E+12 desktop 60.32 63.5 3.18
2020_02_01 1.5805E+12 mobile 60.91 63.1 2.19
2020_01_01 1.5778E+12 desktop 55.1 59.2 4.1
2020_01_01 1.5778E+12 mobile 55.11 59.3 4.19
2019_12_01 1.5752E+12 desktop 54.37 58.9 4.53
2019_12_01 1.5752E+12 mobile 54.27 58.9 4.63
2019_11_01 1.5726E+12 desktop 47.22 58 10.78
2019_11_01 1.5726E+12 mobile 53.51 58.2 4.69
2019_10_01 1.5699E+12 desktop 52.55 57.1 4.55
2019_10_01 1.5699E+12 mobile 52.43 56.9 4.47
2019_09_01 1.5673E+12 desktop 51.8 56.2 4.4
2019_09_01 1.5673E+12 mobile 53.47 56 2.53
2019_08_01 1.5646E+12 desktop 51.4 55.7 4.3
2019_08_01 1.5646E+12 mobile 55.16 55.5 0.34
2019_07_01 1.5619E+12 desktop 51.81 54.9 3.09
2019_07_01 1.5619E+12 mobile 54.53 54.8 0.27
2019_06_01 1.5593E+12 desktop 50.83 53.8 2.97
2019_06_01 1.5593E+12 mobile 50.21 53.3 3.09
2019_05_01 1.5567E+12 desktop 48.16 53.1 4.94
2019_05_01 1.5567E+12 mobile 47.38 52.6 5.22
2019_04_01 1.5541E+12 desktop 45.57 52.3 6.73
2019_04_01 1.5541E+12 mobile 44.19 52 7.81
2019_03_01 1.5514E+12 desktop 48.49 50.6 2.11
2019_03_01 1.5514E+12 mobile 47.34 50.7 3.36
2019_02_01 1.549E+12 desktop 49.63 49.7 0.07
2019_02_01 1.549E+12 mobile 49.79 49.8 0.01
        48.3 48.3
        48.3 48.3
2018_12_15 1.5448E+12 desktop 32.8 47.8 15
2018_12_15 1.5448E+12 mobile 36.73 48.9 12.17
        49.1 49.1
        48.8 48.8
2018_11_15 1.5422E+12 desktop 46.92 48.4 1.48
2018_11_15 1.5422E+12 mobile 46.87 48.4 1.53
2018_11_01 1.541E+12 desktop 46.27 47.8 1.53
        47.5 47.5
2018_10_15 1.5396E+12 desktop 45.53 46.5 0.97
2018_10_15 1.5396E+12 mobile 45.13 46.2 1.07
2018_10_01 1.5384E+12 desktop 45.89 46 0.11
2018_10_01 1.5384E+12 mobile 45.53 45.5 -0.03
2018_09_15 1.537E+12 desktop 45.66 45.8 0.14
2018_09_15 1.537E+12 mobile 45.19 45.2 0.01
2018_09_01 1.5358E+12 desktop 44.83 45 0.17
2018_09_01 1.5358E+12 mobile 44.6 44.6 0
2018_08_15 1.5343E+12 desktop 44.65 44.8 0.15
        44.9 44.9
2018_08_01 1.5331E+12 desktop 44.26 44.4 0.14
2018_08_01 1.5331E+12 mobile 44.61 44.6 -0.01
2018_07_15 1.5316E+12 desktop 43.77 44 0.23
2018_07_15 1.5316E+12 mobile 44.3 44.3 0
2018_07_01 1.5304E+12 desktop 43.42 43.6 0.18
2018_07_01 1.5304E+12 mobile 41.37 41.6 0.23
2018_06_15 1.529E+12 desktop 38.59 38.8 0.21
2018_06_15 1.529E+12 mobile 40.36 40.6 0.24
2018_06_01 1.5278E+12 desktop 38.17 38.2 0.03
2018_06_01 1.5278E+12 mobile 39.9 40.1 0.2
2018_05_15 1.5263E+12 desktop 38.16 38.3 0.14
2018_05_15 1.5263E+12 mobile 39.56 39.7 0.14
2018_05_01 1.5251E+12 desktop 37.94 38 0.06
2018_05_01 1.5251E+12 mobile 39.21 39.4 0.19
2018_04_15 1.5238E+12 desktop 37.59 37.6 0.01
2018_04_15 1.5238E+12 mobile 39.16 39.4 0.24
        37.1 37.1
        38.7 38.7
2018_03_15 1.5211E+12 desktop 36.67 36.8 0.13
2018_03_15 1.5211E+12 mobile 37.82 38 0.18
2018_03_01 1.5199E+12 desktop 35.9 35.9 0
2018_03_01 1.5199E+12 mobile 37.1 37.3 0.2
2018_02_15 1.5187E+12 desktop 35.46 35.5 0.04
2018_02_15 1.5187E+12 mobile 36.39 36.5 0.11
2018_02_01 1.5174E+12 desktop 35.23 35.3 0.07
2018_02_01 1.5174E+12 mobile 35.98 36.1 0.12
2018_01_15 1.516E+12 desktop 33.9 34 0.1
2018_01_15 1.516E+12 mobile 34.69 34.8 0.11
2018_01_01 1.5148E+12 desktop 33.3 33.7 0.4
2018_01_01 1.5148E+12 mobile 34.3 34.7 0.4
2017_12_15 1.5133E+12 desktop 33 33.4 0.4
2017_12_15 1.5133E+12 mobile 34.03 34.4 0.37
2017_12_01 1.5121E+12 desktop 31.92 32.4 0.48
2017_12_01 1.5121E+12 mobile 32.58 33.1 0.52
2017_11_15 1.5107E+12 desktop 31.39 31.8 0.41
        32.6 32.6
2017_11_01 1.5095E+12 desktop 31.11 31.5 0.39
2017_11_01 1.5095E+12 mobile 31.76 32.4 0.64
2017_10_15 1.508E+12 desktop 30.19 30.6 0.41
2017_10_15 1.508E+12 mobile 31.06 31.5 0.44
2017_10_01 1.5068E+12 desktop 29.89 30.2 0.31
2017_10_01 1.5068E+12 mobile 30.54 31.1 0.56
2017_09_15 1.5054E+12 desktop 28.88 29.2 0.32
2017_09_15 1.5054E+12 mobile 29.43 30 0.57
2017_09_01 1.5042E+12 desktop 28.21 0 -28.21
2017_09_01 1.5042E+12 mobile 29 0.1 -28.9
2017_08_15 1.5028E+12 desktop 27.25 0 -27.25
2017_08_15 1.5028E+12 mobile 28.07 0 -28.07
2017_08_01 1.5015E+12 desktop 26.76 0 -26.76
2017_08_01 1.5015E+12 mobile 27.41 0 -27.41
2017_07_15 1.5001E+12 desktop 26.63 26.5 -0.13
2017_07_15 1.5001E+12 mobile 27.02 27.1 0.08
2017_07_01 1.4989E+12 desktop 26.14 26 -0.14
2017_07_01 1.4989E+12 mobile 26.44 26.5 0.06
2017_06_15 1.4975E+12 desktop 25.29 25.2 -0.09
2017_06_15 1.4975E+12 mobile 25.88 26 0.12
2017_06_01 1.4963E+12 desktop 25.05 25 -0.05
2017_06_01 1.4963E+12 mobile 25.47 25.7 0.23
2017_05_15 1.4948E+12 desktop 25.02 24.9 -0.12
2017_05_15 1.4948E+12 mobile 25.29 25.5 0.21
2017_05_01 1.4936E+12 desktop 24.87 23.9 -0.97
2017_05_01 1.4936E+12 mobile 24.49 23.8 -0.69
2017_04_15 1.4922E+12 desktop 25.12 24.9 -0.22
2017_04_15 1.4922E+12 mobile 25.41 25.2 -0.21
2017_04_01 1.491E+12 desktop 24.55 24.7 0.15
2017_04_01 1.491E+12 mobile 24.69 24.9 0.21
2017_03_15 1.4895E+12 desktop 23.78 24 0.22
2017_03_15 1.4895E+12 mobile 23.69 23.9 0.21
2017_03_01 1.4883E+12 desktop 23.4 23.4 0
2017_03_01 1.4883E+12 mobile 23.3 23.4 0.1
2017_02_15 1.4871E+12 desktop 23.07 23.1 0.03
2017_02_15 1.4871E+12 mobile 22.91 23.1 0.19
2017_02_01 1.4859E+12 desktop 22.74 22.8 0.06
2017_02_01 1.4859E+12 mobile 22.85 22.9 0.05
        22 22
2017_01_15 1.4844E+12 mobile 22 22 0
        21.3 21.3
2017_01_01 1.4832E+12 mobile 21.58 21.6 0.02
2016_12_15 1.4818E+12 desktop 19.68 20.9 1.22
        21.3 21.3
        20.7 20.7
        21.2 21.2
2016_11_15 1.4792E+12 desktop 20.54 20.3 -0.24
2016_11_15 1.4792E+12 mobile 20.55 20.6 0.05
2016_11_01 1.478E+12 desktop 20.25 20.3 0.05
2016_11_01 1.478E+12 mobile 19.91 20 0.09
2016_10_15 1.4765E+12 desktop 18.66 18.6 -0.06
2016_10_15 1.4765E+12 mobile 19.37 19.7 0.33
2016_10_01 1.4753E+12 desktop 18.5 18.7 0.2
2016_10_01 1.4753E+12 mobile 19.32 19.5 0.18
2016_09_15 1.4739E+12 desktop 17.11 17.4 0.29
2016_09_15 1.4739E+12 mobile 17.29 17.5 0.21
2016_09_01 1.4727E+12 desktop 16.45 16.5 0.05
2016_09_01 1.4727E+12 mobile 16.66 16.5 -0.16
2016_08_15 1.4712E+12 desktop 16.49 16.5 0.01
2016_08_15 1.4712E+12 mobile 16.4 16.4 0
2016_08_01 1.47E+12 desktop 16.36 16.4 0.04
        16.2 16.2
2016_07_15 1.4685E+12 desktop 15.9 0 -15.9
        0 0
2016_07_01 1.4673E+12 desktop 15.47 0 -15.47
        0 0
2016_06_15 1.4659E+12 desktop 15.16 0 -15.16
        0 0
2016_06_01 1.4647E+12 desktop 13.72 0 -13.72
        0 0
2016_05_15 1.4633E+12 desktop 13.15 0 -13.15
        0 0
2016_05_01 1.4621E+12 desktop 0 0 0
        0 0
2016_04_15 1.4607E+12 desktop 0 0 0
        0 0
2016_04_01 1.4595E+12 desktop 0 0 0
        0 0
2016_03_15 1.458E+12 desktop 0 0 0
        0 0
2016_03_01 1.4568E+12 desktop 0 0 0
2016_03_01 1.4568E+12 mobile 0 0 0
2016_02_15 1.4555E+12 desktop 0 0 0
2016_02_15 1.4555E+12 mobile 0 0 0
2016_02_01 1.4543E+12 desktop 0 0 0
2016_02_01 1.4543E+12 mobile 0 0 0
2016_01_15 1.4528E+12 desktop 0 0 0
2016_01_15 1.4528E+12 mobile 0 0 0
2016_01_01 1.4516E+12 desktop 0 0 0
2016_01_01 1.4516E+12 mobile 0 0 0

tunetheweb avatar Jun 23 '21 15:06 tunetheweb

It should be ok to rely on the pages dataset for reports, but I think the problem is that our timeseries queries require processing historical data every single time and this dataset is very expensive (like lighthouse). I think the first step to resolving this in the general case is to change our timeseries queries to be more iterative and only process the most recent month and append that to the JSON. We'd also need to handle when the data already exists in the JSON, eg if the script ran twice by accident or we had to backfill bad data in an older crawl.

rviscomi avatar Jun 24 '21 19:06 rviscomi

Proposed a fix for incremental runs in #115

I still think we should consider patching the old data, and then separately also correcting the historic data.

The incorrect data is absolutely meaningless so pointless to keep around. And I dislike having such an expensive query exposed on the website incase anyone attempts to run it for their own reasons.

tunetheweb avatar Jul 03 '21 17:07 tunetheweb