classic-dark icon indicating copy to clipboard operation
classic-dark copied to clipboard

Reduce trace costs

Open pbiggar opened this issue 1 year ago • 1 comments

Our OpenTelemetry provider is putting their prices up, so we should reduce how much we use.

Currently, we're using about 1.2B events and the next lowest threshold is 450M.

They are currently split:

cloudsql-proxy 0.11% kubernetes-bwd-nginx 0.15% kubernetes-bwd-ocaml 57.03% (1.13B) kubernetes-garbagecollector 38.02% (376M) kubernetes-metrics 4.69% (45M)

Among kubernetes-bwd-ocaml, they are split:

BwdServer | 608,015,209 QueueWorker | 354,919,048 ApiServer | 66,742,393 CronChecker | 38,742,278 other  | 5,528,954

Note the numbers don't add up because we had a big month for BwdServer due to an anomaly.

To address this:

  • [ ] use TraceRatio samplers one each service (20% for BwdServer, 20% for QW, 100% for others)
    • [x] write code
    • [x] merge to dark repo
    • [x] backport to classic-dark repo
    • [x] merge & deploy
    • [x] add flags to LaunchDarkly.
      • [x] add flags
      • [x] BwdServer
      • [x] Queueworker
      • [ ] check it works
    • [x] Reduce plan
  • [x] use honeycomb sampling for garbagecollector (5% should be fine, I'd be surprised if we ever look at this again)
    • [x] merge change
    • [x] check it worked
  • [x] disable k8s metrics (we get this from google cloud anyway)
    • [x] merge change
    • [x] check it worked

Overall, this should reduce us from 1.8B in march to: BwdServer: 121M QueueWorker: 71M ApiServer: 67M CronChecker: 39M kubernetes-bwd-ocaml other: 6M garbagecollector: 18M

Overall around 350M

pbiggar avatar Apr 04 '23 22:04 pbiggar

Confirmed this is in production and works. Just final confirmation needed that this does in fact lower telemetry usage.

pbiggar avatar May 07 '23 01:05 pbiggar