ztunnel icon indicating copy to clipboard operation
ztunnel copied to clipboard

Consistent error "proxy::h2 ping error: broken pipe" from ztunnel containers across cluster

Open jessebye opened this issue 1 year ago • 6 comments

We have rolled out Istio with ambient mode enabled, and observed that there is a consistent large volume of this error being logged by the ztunnel container:

2024-10-25T19:47:50.381780Z	error	proxy::h2	ping error: broken pipe

In just the past 24 hours we observed this error happening 2.29K times (across 40-50 nodes).

Possibly unrelated, but we also discovered this error happening regularly:

2024-10-25T18:09:30.842925Z	error	access	connection complete	src.addr=10.0.81.228:58446 src.workload="pronode-7xxxxxx-xxxx" src.namespace="services" src.identity="spiffe://cluster.local/ns/services/sa/pronode" dst.addr=10.0.x.x:15008 dst.hbone_addr=10.0.x.x:8480 dst.service="portfolios-graphql.services.svc.cluster.local" dst.workload="portfolios-graphql-6xxxxxxx-xxxxx" dst.namespace="services" dst.identity="spiffe://cluster.local/ns/services/sa/portfolios-graphql" direction="outbound" bytes_sent=5226 bytes_recv=19440 duration="1546750ms" error="while closing connection: send: io error: stream closed because of a broken pipe"

We see corresponding 502 errors being logged from our services for these requests. We only began to observe the elevated 502 rate after switching to Ambient mode.

jessebye avatar Oct 25 '24 20:10 jessebye

Thanks for the report. SEe https://github.com/istio/istio/wiki/Troubleshooting-Istio-Ambient#scenario-ztunnel-logs-hbone-ping-timeouterror-and-ping-timeout

howardjohn avatar Oct 25 '24 20:10 howardjohn

@howardjohn we are on Istio 1.23.2 though, so shouldn't be seeing those errors? Also note it's not a ping timeout, it's broken pipe.

jessebye avatar Oct 25 '24 20:10 jessebye

Thanks for the correction. That does seem different then.

Focusing on this while closing connection: send: io error: stream closed because of a broken pipe" error since it gives a bit more info... this means while we were attempting to close the connection we found it was already closed. This is after ~25minutes, so its a long live connection.

Do we have any info on what is going on in the destination side? It seems plausible the destination app/destination ztunnel shutdown?

howardjohn avatar Oct 25 '24 20:10 howardjohn

Ok, yes I think these are long running requests timing out because the pod went away or something like that. In which case this is probably not a problem for ztunnel. However I am still wondering about those h2 pings that get a broken pipe error.

jessebye avatar Oct 25 '24 20:10 jessebye

I suspect those are the similar cause. The backend is closing as we try to send a ping?

howardjohn avatar Oct 25 '24 20:10 howardjohn

That would be surprising, given it is happening so frequently. I know we have pods going up and down every few minutes maybe, but not 95x per hour 🤔

jessebye avatar Oct 25 '24 20:10 jessebye

@howardjohn Hi Howard, did you have any chance to look at "while closing connection: send: io error: stream closed because of a broken pipe" ? We are seeing a bunch of errors in ztunnel.

piljaechae avatar May 13 '25 06:05 piljaechae

We are also seeing this error (istio ambient 1.26.1). It started for on Sunday in staging cluster which is pretty much time of nothing going on in that cluster. Start of this error correlates a lot with another error from ztunnel. This is between various services in the cluster.

http status: 503 Service Unavailable

They all started in the same minute on multiple ztunnel pods. This is start of this log. Before we had exactly 0 of 503s

Image

ping error

Image

jan-ludvik avatar Jun 18 '25 14:06 jan-ludvik

~~Looking further it seems like all services that are reporting 503 now are using default service account (they don't have any configured). Identity is spiffe://cluster.local/ns/<ns_redacted>/sa/default. These are source identities. Destination identities are using specific service account created for that service. Can it be connected somehow?~~ I found this is from one service, that's why all the source identities were on default service account (by coincidence it uses default service account) but I don't believe that would be relevant. More likely the service is doing something bad.

jan-ludvik avatar Jun 18 '25 14:06 jan-ludvik

We also get plenty of these in ztunnel logs. I also checked that these happens even if none container is killed in the whole cluster (yup, I checked kubernetes-events). For example 3 hours period of time when all the containers were running without any kill - each dot represents "ping error: broken pipe" error in ztunnel logs:

Image

EKS: v1.34 istio ambient mode (with istio ingress gateway and ztunnel, no waypoint): v1.27.3

wosiu avatar Nov 04 '25 21:11 wosiu