Bug: Query-frontend gossip connection failures with Istio
What is the bug?
We are running mimir version 2.17.1 on a kubernetes cluster with istio enabled. Everything works fine but we have observed some unusual traffic w.r.t query frontend, where mimir components are trying to gossip with query frontend. But when i look at https://grafana.com/docs/mimir/latest/configure/configure-hash-rings/ which mentions mimir components which maintain hash rings and query frontend is not one of them.
mimir-compactor-0 compactor ts=2025-09-30T12:20:22.854761262Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:e9c5::2a]:7946 err="digest: write tcp [2a03:1e84:1902:14:9da3::14]:50202->[2a03:1e84:1902:14:e9c5::2a]:7946: write: connection reset by peer"
mimir-ingester-1 ingester ts=2025-09-30T12:20:25.259176423Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:6b73::1c]:7946 err="sending data: write tcp [2a03:1e84:1902:14:ed85::5]:43038->[2a03:1e84:1902:14:6b73::1c]:7946: write: broken pipe"
mimir-compactor-0 compactor ts=2025-09-30T12:20:25.45488119Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:6b73::1c]:7946 err="sending data: write tcp [2a03:1e84:1902:14:9da3::14]:41280->[2a03:1e84:1902:14:6b73::1c]:7946: write: broken pipe"
mimir-querier-87bdc8c8-xhgxj querier ts=2025-09-30T12:20:27.320626871Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:e9c5::2a]:7946 err="digest: write tcp [2a03:1e84:1902:13:f6d0::1a]:60706->[2a03:1e84:1902:14:e9c5::2a]:7946: write: broken pipe"
mimir-querier-87bdc8c8-xhgxj querier ts=2025-09-30T12:20:30.54214228Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:e9c5::2a]:7946 err="sending data: write tcp [2a03:1e84:1902:13:f6d0::1a]:60718->[2a03:1e84:1902:14:e9c5::2a]:7946: write: broken pipe"
mimir-store-gateway-0 store-gateway ts=2025-09-30T12:20:30.450756265Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:6b73::1c]:7946 err="sending data: write tcp [2a03:1e84:1902:14:a2f3::13]:57620->[2a03:1e84:1902:14:6b73::1c]:7946: write: broken pipe"
mimir-querier-87bdc8c8-xhgxj querier ts=2025-09-30T12:20:32.152130435Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:6b73::1c]:7946 err="sending data: write tcp [2a03:1e84:1902:13:f6d0::1a]:43060->[2a03:1e84:1902:14:6b73::1c]:7946: write: broken pipe"
mimir-ruler-9bcbc5b67-qrh6r ruler ts=2025-09-30T12:20:32.667270951Z caller=tcp_transport.go:496 level=warn component="memberlist TCPTransport" msg="WriteTo failed" addr=[2a03:1e84:1902:14:6b73::1c]:7946 err="sending data: write tcp [2a03:1e84:1902:14:6b73::70]:47418->[2a03:1e84:1902:14:6b73::1c]:7946: write: broken pipe"
How to reproduce it?
Deploy mimir version 2.17.1 using the mimir-distributed helm chart version 5.8.0 on a Kubernetes cluster with Istio enabled.
What did you think would happen?
There was a clarification in the grafana mimir slack channel that query-frontend indeed is involved in gossip but does not maintain its own hash ring. From the helm chart i have following observations
- Mimir components which are part of memberlist define following port as part of pod spec used for gossip, frontend does not define one.
name: memberlist
containerPort: {{ include "mimir.memberlistBindPort" . }}
protocol: TCP
- Another issue we observed is query-frontend does not define following label which is used by mimir components and as selector label in gossip ring headless service. Istio blocks the traffic to the query-frontend:7946 because of this reason. app.kubernetes.io/part-of: memberlist
After making above changes with local overrides, we no longer see the error logs.
What was your environment?
Mimir Version: 2.17.1 Deployment Method: mimir-distributed Helm chart version 5.8.0 Platform: Kubernetes cluster with Istio enabled Network: IPv6 enabled Scheduler Discovery Mode: DNS (default when scheduler is enabled)
Any additional context to share?
No response
I was taking a look at fixing this issue but I'm actually confused about why this is happening at all.
It's true that the query-frontend doesn't define labels that allow it to participate in the gossip ring in the helm chart. In jsonnet the query-frontend does define these labels but only when using the "ring" scheduler discovery mode - when using "dns" it does not. This makes me think that the query-frontend shouldn't be part of the gossip ring service that other components use to discover who to gossip with.
I don't understand why other components were trying to connect to gossip to the query-frontend.
@sirishkumar I've been trying for a while to get Mimir (with helm) to work with Istio, would you mind telling me how you managed to achieve that?
@danlucioprada May be comments in the issue https://github.com/grafana/mimir/discussions/4714#discussioncomment-12374461 helps. Let me know if you have any other questions
Correct me if I'm wrong, but theoretically, for Mimir to work with Istio, I need to add appProtocol: tcp to the services that are gRPC and disable the following in the Helm chart by setting it to false:
query_scheduler:
enabled: false
This is so that the mimir-query-frontend-headless service is generated. However, interestingly, starting from version 5.8 of the chart, this mimir-query-frontend-headless template is no longer generated by Helm, and I haven't found anything that explains why. So I need to manually add this headless service, right?
Template folder for version 6.0.3.
@danlucioprada thats right, we deploy using pulumi. So we create the headless service through pulumi
@danlucioprada thats right, we deploy using pulumi. So we create the headless service through pulumi
I got it working, but I had to make Istio ignore those two ports. Is there a way to make Mimir work within the mesh without ignoring those ports (gRPC and the gossip protocol)?
annotations:
traffic.sidecar.istio.io/excludeInboundPorts: 7946,9095
traffic.sidecar.istio.io/excludeOutboundPorts: 7946,9095
@danlucioprada This should work out of box for many services except ruler and query frontend. Reasons being
- For 7946 which is the gossip ring port, there is a headless service which allows pod-to-pod communication
- Similar for 9095 used for communication between mimir components, there is a headlesss service created by helm chart, exception being query-frontend and ruler
- For frontend and ruler, create a headless service explicitly to allow pod-pod comm
Below screenshot shows the different headless services in our setup
One more thing related to authorization policies, if you have allow nothing auth policy enabled. You need to create Istio authorization policy to allow all communication with in your namespace between mimir components. Same applies for Loki
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
name: allow-mimir-api-access
namespace: metrics
spec:
action: ALLOW
rules:
- from:
- source:
namespaces:
- metrics
selector:
matchLabels:
app.kubernetes.io/name: mimir