alloy
alloy copied to clipboard
mimir.rules.kubernetes does not support federated rule groups source_tenants
Request
Mimir ruler supports federated rule groups via source_tenants
, however it is not supported in the mimir.rules.kubernetes
arguments.
This could easily be added to the river conf, e.g.
mimir.rules.kubernetes "tenant1" {
address = "http://mimir-ruler:8080"
tenant_id = "tenant1"
source_tenants = [
"tenant1",
"tenant2",
]
}
Alternatively, and more flexibly, it could be obtained from an annotation on the prometheusrules
resource:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: tenant1-group1
annotations:
monitoring.grafana.com/source_tenants: tenant1,tenant2
spec:
groups:
- ...
Use case
For a tenant to query the cluster metrics, i.e. pod cpu/memory usage, they need to be able to query the platform metrics tenant as well as themselves.
I've raised https://github.com/grafana/agent/pull/6669 which adds support for the monitoring.grafana.com/source_tenants
crd annotation, as federating every query for a tenant via the river config didn't make sense to me.
I am also very interested in this feature.
I would think it makes more sense that grafana or another party actually creates a new set of CRs for this use case because PrometheusRules are being used for Loki (without prometheus operator validation) where it would probably be "easy" to have something like an Alerting and RecordingRule CRD similar to what is done here https://github.com/grafana/loki/pull/5985/files#diff-76ed4f38ca6a465f642b0789e620c9df313b0238addf2cc7353a9980011f7674R129 in alloy that would have sourceTenants as a field in the spec instead of an annotation?
In the same way, the rule could have a backend
field with type loki, prometheus, tempo (once tempo supports rules) or logql, promql and traceql for instance so it would be easy to build a validating webhook that could parse the expression based on the backend type.
@rfratto do you think this idea makes somewhat sense?
hmm I've also just found the prometheusrules sync doesn't appear to work at all in alloy?
If I go to the alloy UI /component/mimir.rules.kubernetes.rules
to try to debug it I get a blank page on 1.2.0 and 1.3.0 with the following panic:
2024/08/08 07:54:20 http: panic serving 127.0.0.1:37404: runtime error: invalid memory address or nil pointer dereference
goroutine 158762077 [running]:
net/http.(*conn).serve.func1()
/usr/local/go/src/net/http/server.go:1898 +0xbe
panic({0x6d463c0?, 0xd9cbb60?})
/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/grafana/alloy/internal/component/mimir/rules/kubernetes.(*eventProcessor).getMimirState(0x0)
/src/alloy/internal/component/mimir/rules/kubernetes/events.go:240 +0x3f
github.com/grafana/alloy/internal/component/mimir/rules/kubernetes.(*Component).DebugInfo(0xc002d76008)
/src/alloy/internal/component/mimir/rules/kubernetes/debug.go:26 +0x55
github.com/grafana/alloy/internal/runtime/internal/controller.(*BuiltinComponentNode).DebugInfo(0x6c78400?)
/src/alloy/internal/runtime/internal/controller/node_builtin_component.go:420 +0xb2
github.com/grafana/alloy/internal/runtime.(*Runtime).getComponentDetail(0xc00047eff0, {0x92d6918, 0xc002c87688}, 0xc015543ce0, {0xf0?, 0xb8?, 0xc2?, 0x16?})
/src/alloy/internal/runtime/alloy_components.go:128 +0xab4
github.com/grafana/alloy/internal/runtime.(*Runtime).GetComponent(0xc00047eff0, {{0x0, 0x0}, {0xc00c424a9b, 0x1c}}, {0xf8?, 0x29?, 0x38?, 0xc8?})
/src/alloy/internal/runtime/alloy_components.go:37 +0x2b4
github.com/grafana/alloy/internal/web/api.(*AlloyAPI).RegisterRoutes.(*AlloyAPI).getComponentHandler.func3({0x926f570, 0xc015543cc0}, 0xc0040f66c0)
/src/alloy/internal/web/api/api.go:79 +0xa6
net/http.HandlerFunc.ServeHTTP(0x926f510?, {0x926f570?, 0xc015543cc0?}, 0xc0109f7188?)
/usr/local/go/src/net/http/server.go:2166 +0x29
github.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP({{0x920bdc0?, 0xc003842490?}}, {0x926f510?, 0xc011175f20?}, 0xc0040f66c0)
/go/pkg/mod/github.com/grafana/[email protected]/util/httputil/compression.go:91 +0x63
github.com/gorilla/mux.(*Router).ServeHTTP(0xc003854300, {0x926f510, 0xc011175f20}, 0xc0062dcd80)
/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:212 +0x1e2
go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux.traceware.ServeHTTP({{0x7ec461a, 0x5}, {0x92353c8, 0xc00386e140}, {0x92695f0, 0xc002867d40}, {0x920c700, 0xc003854300}, 0x85d7400, 0x0, ...}, ...)
/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/[email protected]/mux.go:179 +0xbb2
github.com/gorilla/mux.(*Router).ServeHTTP(0xc003854240, {0x9263770, 0xc0039421c0}, 0xc0062dc000)
/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:212 +0x1e2
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP({{0x920c700?, 0xc003854240?}, 0xc00384c240?}, {0x9263770, 0xc0039421c0}, 0xc0062dc000)
/go/pkg/mod/golang.org/x/[email protected]/http2/h2c/h2c.go:125 +0x697
net/http.serverHandler.ServeHTTP({0xc016c2b6b0?}, {0x9263770?, 0xc0039421c0?}, 0x6?)
/usr/local/go/src/net/http/server.go:3137 +0x8e
net/http.(*conn).serve(0xc01e112d80, {0x9288ba8, 0xc00384fc80})
/usr/local/go/src/net/http/server.go:2039 +0x5e8
created by net/http.(*Server).Serve in goroutine 530
/usr/local/go/src/net/http/server.go:3285 +0x4b4
also upgrading from 1.2.0 to 1.3.0 seems to break clustering randomly, making this version unusable?
{"ts":"2024-08-08T09:49:12.182927738Z","level":"error","msg":"fatal error: failed to get peers to join at startup - this is likely a configuration error","service":"cluster","err":"static peer discovery: failed to find any valid join addresses: failed to extract host and port: address cluster-0.mimir.svc.cluster.local: missing port in address\nfailed to resolve SRV records: cluster-0.mimir.svc.cluster.local on 192.168.0.10:53: no such host"}
we use a headless service + a deployment.
▶ kubectl get services/cluster-0 -o yaml
apiVersion: v1
kind: Service
metadata:
name: cluster-0
namespace: mimir
spec:
clusterIP: None
clusterIPs:
- None
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: http
port: 8080
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/instance: cluster-0
app.kubernetes.io/name: grafana-agent
sessionAffinity: None
type: ClusterIP
I found the solution to the first issue, it seems that rbac for the prometheusrules crd for the agents had not been added to the cluster. There didn't seem to be a log line for the access denied, but adding the rbac does fix the ui and show it as healthy.
The second issue i'll keep looking into but as it stands we can't upgrade to 1.3.0 as it causes a lot of failures for agents starting up, not in all cases, but still in several.
To the original issue: this feature has become more immediately important to us, and I'd rather not maintain a fork for an implementation that isn't desirable.
If we can agree on a design I can implement it? If we want a custom CRD with the source_tenants
field on it, then that can be done though will be a breaking change, probably require creating a new component and deprecating the existing one, and then will be some pain to get existing users to migrate to it.
Alternatively if the annotation is an acceptable stepping stone for this component, I can rework that implementation for alloy providing we would merge it?
@QuentinBisson @rfratto what do you think?
I personally think having a custom rule for alloy makes more sense in the long run, and alloy could even generate the alertingrule CR based on the PrometheusRules CR to keep the feature parity for a lot of different reasons (promql,logql validation and semantics for instance but also the prometheus operator brings new features that would not need to be also maintained in alloy) but having an intermediate step with the annotation looks like a good middle ground for now which could even be translated later to other CRDs if the alloy team thinks it makes sense.
Also, I do not work at Grafana so I would strongly advise you to wait for one of the alloy maintainer's opinion on this :)