cortex icon indicating copy to clipboard operation
cortex copied to clipboard

[1.19.0] While upgrading, --distributor.shard-by-all-labels is now required on non-related components.

Open EpiJunkie opened this issue 7 months ago • 2 comments

Describe the bug While upgrading from 1.18.1 to 1.19.0 some components required me adding the --distributor.shard-by-all-labels=true arg and would not start otherwise with a failed to load runtime config error.

Components:

  • alertmanager
  • compactor
  • overrides-exporter
  • query-frontend
  • store-gateway

We do have ingester.max-global-series-per-user set.

To Reproduce Steps to reproduce the behavior:

  1. Set distributor.shard-by-all-labels=true in the configuration on the distributor and querier components (default is false).
  2. Use global-series-per-user within the tenant overrides.
  3. Change Cortex image from 1.18.1 to 1.19.0
  4. Wait for restart of component and observe failure. See log excerpts below.

Expected behavior I would think that only the distributor and querier would require this configuration, per the docs and was the case on previous versions.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: jsonnet

Additional Context

While looking at the diff between the versions, this change to pkg/cortex/runtime_config.go seemed relevant (or link to commit directly).

Log excerpts for each component:

alertmanager:

alertmanager ts=2025-05-07T22:52:04.250387661Z caller=cortex.go:451 level=error msg="module failed" module=runtime-config err="invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: The ingester.max-global-series-per-user limit is unsupported if distributor.shard-by-all-labels is disabled"
alertmanager ts=2025-05-07T22:52:04.250437925Z caller=cortex.go:451 level=error msg="module failed" module=memberlist-kv err="failed to start memberlist-kv, because it depends on module server, which has failed: invalid service state: Stopping, expected: Running"
alertmanager ts=2025-05-07T22:52:04.250453482Z caller=cortex.go:451 level=error msg="module failed" module=alertmanager err="failed to start alertmanager, because it depends on module server, which has failed: invalid service state: Stopping, expected: Running"

compactor:

compactor ts=2025-05-07T21:07:05.974944401Z caller=cortex.go:451 level=error msg="module failed" module=runtime-config err="invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: The ingester.max-global-series-per-user limit is unsupported if distributor.shard-by-all-labels is disabled"
compactor ts=2025-05-07T21:07:05.974970033Z caller=cortex.go:451 level=error msg="module failed" module=compactor err="failed to start compactor, because it depends on module runtime-config, which has failed: invalid service state: Failed, expected: Running, failure: invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: The ingester.max-global-series-per-user limit is unsupported if distributor.shard-by-all-labels is disabled"

overrides-exporter:

overrides-exporter ts=2025-05-07T23:46:00.915248501Z caller=cortex.go:451 level=error msg="module failed" module=runtime-config err="invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: The ingester.max-global-series-per-user limit is unsupported if distributor.shard-by-all-labels is disabled"

query-frontend:

query-frontend ts=2025-05-07T23:26:24.092666571Z caller=cortex.go:451 level=error msg="module failed" module=runtime-config err="invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: The ingester.max-global-series-per-user limit is unsupported if distributor.shard-by-all-labels is disabled"
query-frontend ts=2025-05-07T23:26:24.09277482Z caller=cortex.go:451 level=error msg="module failed" module=query-frontend-tripperware err="failed to start query-frontend-tripperware, because it depends on module runtime-config, which has failed: invalid service state: Failed, expected: Running, failure: invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: The ingester.max-global-series-per-user limit is unsupported if distributor.shard-by-all-labels is disabled"
query-frontend ts=2025-05-07T23:26:24.092791764Z caller=cortex.go:451 level=error msg="module failed" module=query-frontend err="failed to start query-frontend, because it depends on module query-frontend-tripperware, which has failed: context canceled"

store-gateway:

store-gateway ts=2025-05-07T22:40:36.588298103Z caller=cortex.go:451 level=error msg="module failed" module=runtime-config err="invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: The ingester.max-global-series-per-user limit is unsupported if distributor.shard-by-all-labels is disabled"
store-gateway ts=2025-05-07T22:40:36.588335863Z caller=cortex.go:451 level=error msg="module failed" module=store-gateway err="failed to start store-gateway, because it depends on module memberlist-kv, which has failed: context canceled"

EpiJunkie avatar May 13 '25 14:05 EpiJunkie

the flag should never be disabled We need to try https://github.com/cortexproject/cortex/issues/6021 again

friedrichg avatar May 14 '25 11:05 friedrichg

It seems due to https://github.com/cortexproject/cortex/pull/6340/files#diff-f70ef13978fead446903645dc3a53f599c9986caef0ee88bd079e46f09231f53R81-R86, but your distributor.shard-by-all-labels value is true. right?

If you set it to true only for the distributor and querier, other components would fail at 1.19.0.

SungJin1212 avatar May 16 '25 09:05 SungJin1212

@SungJin1212 Nice find. This seems indeed a behavior change from that code path.

Help wanted I think we are able to fix this to check the enabled Cortex target to see if we want to perform the check.

yeya24 avatar Jul 13 '25 19:07 yeya24

@yeya24 I would fix it.

SungJin1212 avatar Jul 14 '25 02:07 SungJin1212