classic-dark Reduce trace costs

Reduce trace costs

Open pbiggar opened this issue 1 year ago • 1 comments

Our OpenTelemetry provider is putting their prices up, so we should reduce how much we use.

Currently, we're using about 1.2B events and the next lowest threshold is 450M.

They are currently split:

cloudsql-proxy 0.11% kubernetes-bwd-nginx 0.15% kubernetes-bwd-ocaml 57.03% (1.13B) kubernetes-garbagecollector 38.02% (376M) kubernetes-metrics 4.69% (45M)

Among kubernetes-bwd-ocaml, they are split:

Note the numbers don't add up because we had a big month for BwdServer due to an anomaly.

To address this:

[ ] use TraceRatio samplers one each service (20% for BwdServer, 20% for QW, 100% for others)
- [x] write code
- [x] merge to dark repo
- [x] backport to classic-dark repo
- [x] merge & deploy
- [x] add flags to LaunchDarkly.
  - [x] add flags
  - [x] BwdServer
  - [x] Queueworker
  - [ ] check it works
- [x] Reduce plan
[x] use honeycomb sampling for garbagecollector (5% should be fine, I'd be surprised if we ever look at this again)
- [x] merge change
- [x] check it worked
[x] disable k8s metrics (we get this from google cloud anyway)
- [x] merge change
- [x] check it worked

Overall, this should reduce us from 1.8B in march to: BwdServer: 121M QueueWorker: 71M ApiServer: 67M CronChecker: 39M kubernetes-bwd-ocaml other: 6M garbagecollector: 18M

Overall around 350M

Apr 04 '23 22:04 pbiggar

Confirmed this is in production and works. Just final confirmation needed that this does in fact lower telemetry usage.

May 07 '23 01:05 pbiggar

classic-dark classic-dark copied to clipboard

Reduce trace costs

classic-dark
classic-dark copied to clipboard