kube-state-metrics icon indicating copy to clipboard operation
kube-state-metrics copied to clipboard

fix: fallback to `gauge` for `protofmt`-based negotiations

Open rexagod opened this issue 1 year ago • 20 comments

What this PR does / why we need it: Fallback to gauge metric type when the negotiated content-type is protofmt-based, since Prometheus' protobuf machinery does not recognize all OpenMetrics types (info and statesets in this context).

How does this change affect the cardinality of KSM: None.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #2248

rexagod avatar Dec 13 '23 07:12 rexagod

Need to test this.

rexagod avatar Dec 13 '23 07:12 rexagod

Verified using the following configurations. Error logs were reported without this patch, but none when this was applied.

Details

Prometheus error log

ts=2023-12-13T16:01:02.547Z caller=scrape.go:1418 level=debug component="scrape manager" scrape_pool=kube-state-metrics target=http://127.0.0.1:8080/metrics msg="Append failed" err="invalid metric type \"info\""

KSM CRS Configuration

	  ...
      metrics:
        - name: "foo_info"
          help: "foo_help"
          each:
            type: Info
            info:
              path: [metadata]
              labelsFromPath:
                "name": [name]

Prometheus Configuration

global:
  evaluation_interval: 15s
  scrape_interval:     15s
  scrape_protocols: [PrometheusProto]
scrape_configs:
  - job_name: 'kube-state-metrics'
    static_configs:
      - targets: ['127.0.0.1:8080']

Generated CRS Metric (from Prometheus UI)

kube_customresource_foo_info{customresource_group="contoso.com", customresource_kind="MyPlatform", customresource_version="v1alpha1", instance="127.0.0.1:8080", job="kube-state-metrics", name="test-dotnet-app"} 1

rexagod avatar Dec 13 '23 16:12 rexagod

/triage accepted /assign @dgrisonnet

logicalhan avatar Dec 14 '23 17:12 logicalhan

How about a different approach.

The current storage is built based on the assumption that the returned types never changes (which used to hold true until recently). Now that we support both the Prometheus text format and the OpenMetrics format, this isn't valid anymore.

The current approach you have is expensive because we do a search in the header of each metric to replace the type.

How about we just don't add it to the generated header stored in the cache anymore (from https://github.com/kubernetes/kube-state-metrics/blob/0bc4b19ac5cf7039fa877a43a85646367f3d63a4/pkg/metric_generator/generator.go#L93), and rather add it close to when the format is negotiated? Maybe in a new function before SanitizeHeader. We would then just need to surface the metrics family type there somehow and then make a decision on the type added to the header depending on the format that was negociated.

dgrisonnet avatar Dec 14 '23 18:12 dgrisonnet

A couple of notes around the latest patch.

  • The earlier approach used strings.Has* and strings.LastIndex functions, with respective complexities of O(1) and O(n). This patch removes the dependency on the latter so the whole operation is lazily run in constant time.
  • Manipulate headers only when they are not duplicates (since they will be removed before any operations).

Benchmarks after the first optimization (comparing the first commit v/s second one on this PR)

goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │   PUSHED    │               STAGED                │
                                                  │   sec/op    │   sec/op     vs base                │
SanitizeHeaders/text-format_unique_headers-10       561.4µ ± 0%   561.2µ ± 0%        ~ (p=0.811 n=10)
SanitizeHeaders/text-format_duplicate_headers-10    529.1µ ± 0%   281.6µ ± 0%  -46.77% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10      4.047m ± 0%   3.985m ± 0%   -1.54% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10   591.4µ ± 0%   279.1µ ± 0%  -52.81% (p=0.000 n=10)
geomean                                             918.2µ        647.5µ       -29.48%

Benchmarks after both optimizations (comparing the first commit v/s second one on this PR)

goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │    PUSHED    │                 STAGED                 │
                                                  │    sec/op    │    sec/op      vs base                 │
SanitizeHeaders/text-format_unique_headers-10       561.39µ ± 0%    31.79µ ±  0%   -94.34% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10    529.06µ ± 0%    31.82µ ±  0%   -93.98% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10       4.047m ± 0%   34.068m ± 65%  +741.84% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10   591.35µ ± 0%    96.74µ ± 33%   -83.64% (p=0.000 n=10)
geomean                                              918.2µ         240.3µ         -73.83%

How about we just don't add it to the generated header [...]

I'm doing something similar here, where instead of removing the multi-length type substrings, we append their enum values instead (unit-length). This means we won't need to search for the last " " anymore to trim that off, while still retaining the type data within the header required to convert the header back to the expected representation. In addition to the optimizations introduced, this makes the operation execute in constant-time.

However, either way (removing the type entirely, or appending a known-length suffix) we see that the tests need to be reformatted dramatically (lots of repeated copy-pasting), so I was wondering I'll hold off on doing that until we have an approach that looks good after reviews (either the one that's currently in this patch, or any other that may be suggested).

EDIT: Additionally, I've removed the extraneous type definitions in the CRS package and made metrics the single source of truth. This should also help avoid encounters between "Info" and "info", etc. No user-facing features were changed.

rexagod avatar Dec 15 '23 12:12 rexagod

Posting baseline benchmarks, forgot to do this earlier.

  • Baseline (same as main) v/s changes in this PR.
goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │   BASELINE   │                   HEAD                    │
                                                  │    sec/op    │     sec/op      vs base                   │
SanitizeHeaders/text-format_unique_headers-10       251.62µ ± 0%     31.17µ ±  0%     -87.61% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10    248.52µ ± 4%     31.22µ ±  0%     -87.44% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10       250.5µ ± 0%   34093.4µ ± 49%  +13509.75% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10   247.63µ ± 1%     95.04µ ± 41%     -61.62% (p=0.000 n=10)
geomean                                              249.6µ          237.0µ            -5.05%
  • main + skip empty headers v/s changes in this PR (main outperforms this PR when protobuf sanitization is the only thing different in this PR).
goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │   BASELINE   │                   HEAD                    │
                                                  │    sec/op    │     sec/op      vs base                   │
SanitizeHeaders/text-format_unique_headers-10       249.42µ ± 0%     31.14µ ±  0%     -87.51% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10     42.73µ ± 0%     31.12µ ±  0%     -27.16% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10       251.1µ ± 0%   36649.4µ ± 64%  +14492.68% (p=0.000 n=10)
SanitizeHeaders/proto-format_duplicate_headers-10    42.74µ ± 0%     93.45µ ± 38%    +118.64% (p=0.000 n=10)
geomean                                              103.4µ          240.0µ          +132.09%

To summarize, we optimized the flow with the first one quite a bit (to the point where it negated the 1.3x more impact on SanitizeHeaders, and gave us a 0.05x boost), but if both main and this PR have that in common, with the only feature differentiating this PR being the proto sanitization, that impacts the performance compared to the baseline.

Please note that these are the stats compared to the baseline, and not the regex implementation (that's present here).

rexagod avatar Dec 15 '23 13:12 rexagod

Added a couple of O(1) checks from the first commit back in. The benchmarks may differ slightly, but the overall inference should be the same.

rexagod avatar Dec 15 '23 14:12 rexagod

Thanks for running those benchmarks, this is super helpful to get confidence.

mrueg avatar Dec 18 '23 10:12 mrueg

Good for reviews.

rexagod avatar Dec 19 '23 14:12 rexagod

/hold

Need to add a test too, will push shortly.

rexagod avatar Dec 20 '23 00:12 rexagod

/unhold

rexagod avatar Dec 20 '23 01:12 rexagod

/hold (for @rexagod to unhold)

/lgtm

Changeset looks good, did you test this with your local setup that scrapes with protobuf?

mrueg avatar Dec 20 '23 10:12 mrueg

Pasting the requested information below.

Testing locally on a Prometheus Protobuf setup. Screenshot 2023-12-20 at 17 05 11 Screenshot 2023-12-20 at 17 07 14
Benchmarking this PR against `main`.
goos: darwin
goarch: arm64
pkg: k8s.io/kube-state-metrics/v2/pkg/metrics_store
                                                  │    BASELINE     │                 HEAD                  │
                                                  │     sec/op      │    sec/op     vs base                 │
SanitizeHeaders/text-format_unique_headers-10           255.0µ ± 3%   1225.8µ ± 3%  +380.70% (p=0.000 n=10)
SanitizeHeaders/text-format_duplicate_headers-10    258262.50n ± 4%    10.87n ± 0%  -100.00% (p=0.000 n=10)
SanitizeHeaders/proto-format_unique_headers-10                         2.071m ± 0%
SanitizeHeaders/proto-format_duplicate_headers-10                      18.50n ± 1%
geomean                                                 256.6µ         4.754µ        -98.58%

rexagod avatar Dec 20 '23 12:12 rexagod

/cc @dgrisonnet

(for another review before merging)

rexagod avatar Dec 20 '23 12:12 rexagod

@dgrisonnet I believe all concerns have been addressed, can you PTAL once more (and LGTM if this looks good to you)?

rexagod avatar Jan 25 '24 19:01 rexagod

Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?

schahal avatar Feb 05 '24 23:02 schahal

Yes @schahal

dgrisonnet avatar Feb 07 '24 13:02 dgrisonnet

@rexagod didn't we discuss potentially removing OpenMetrics format support?

dgrisonnet avatar Feb 07 '24 13:02 dgrisonnet

@dgrisonnet Yes, I believe we talked about dropping metric-standard lock-ins, and implementing a one-fits-all mechanism (probably on top of gauges) that allows us to generate an exposition format that is ingestible by most metric backends, if not all.

rexagod avatar Feb 20 '24 07:02 rexagod

(friendly ping @mrueg @dgrisonnet to /lgtm)

rexagod avatar Feb 20 '24 21:02 rexagod

2.11.0 introduced a regression, as metrics include stateset entries which are refused by Prometheus. This worked correctly on 2.10 so I'm stuck until this MR is merged.

remram44 avatar Mar 20 '24 19:03 remram44

/lgtm /approve

dgrisonnet avatar Mar 27 '24 18:03 dgrisonnet

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, mrueg, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [dgrisonnet,mrueg,rexagod]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Mar 27 '24 18:03 k8s-ci-robot

/unhold

dgrisonnet avatar Mar 27 '24 18:03 dgrisonnet

Hey @rexagod ! is this also meant to fix #1973 replicating in 2.110??

Efrat19 avatar Mar 28 '24 09:03 Efrat19

It should, yes.

rexagod avatar Mar 28 '24 11:03 rexagod

Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?

@dgrisonnet Looks like this fix got merged in too late for v2.11.0. How do we feel about cutting a new release soon™ to get people unblocked? (Thanks in advance.)

LaikaN57 avatar Mar 29 '24 18:03 LaikaN57

Hi folks, is this change targeted to be in the release-2.11 cut (which has the k8s 1.28 support of the client libs)?

@dgrisonnet Looks like this fix got merged in too late for v2.11.0. How do we feel about cutting a new release soon™ to get people unblocked? (Thanks in advance.)

@rexagod bump ^

LaikaN57 avatar Apr 02 '24 00:04 LaikaN57

I'll put https://github.com/kubernetes/kube-state-metrics/pull/2335 up for review today.

rexagod avatar Apr 02 '24 07:04 rexagod

Sharing here for visibility: https://github.com/kubernetes/kube-state-metrics/pull/2335#issuecomment-2032148989.

rexagod avatar Apr 02 '24 14:04 rexagod