Prometheus in Istio Mesh cannot scrape metrics when PeerAuthentication MTLS mode is Permissive
Bug Description
Let me start by saying I am not sure if it is more appropriate to file this issue with Prometheus, or with Istio. If this is better suited for Prometheus, please close the issue.
Given a Prometheus instance inside the mesh, When the Prometheus scrape target is also inside the mesh, And the default mesh PeerAuthentication MTLS mode is set to PERMISSIVE, The Prometheus client cannot scrape metrics from endpoints, and receives generic SSL errors (EOF).
I believe this is because MTLS PERMISSIVE mode adds an ALPN filter to the Envoy config, and Istio expects to see ALPN info in the TLS request. ALPN info is automatically set when using normal MTLS with Istio sidecars, but the Prometheus client doesn’t do that - it bypasses Envoy and uses TLS directly.
I can work around this by setting the default mesh MTLS mode to STRICT, and creating workload specific PeerAuthentication policies, but the desired outcome would be to set PERMISSIVE cluster wide and not worry about individual applications.
Version
istioctl/istio:
1.11.4
kubectl/k8s: 1.22
Additional Information
No response
Take a look at https://istio.io/latest/docs/ops/integrations/prometheus/ if you haven't already. We are working on a more long term fix for this but its a ways off (https://docs.google.com/document/d/1NAccj8WyjBXOUsMdOHW9sWW6PeCnrIWpc3muBHd4Cs8/edit#heading=h.xw1gqgyqs5b)
As a workaround I have Prometheus configured to use STRICT MTLS mode according to the TLS settings defined in the integrations doc you linked. Apps that are required to be outside the mesh, but still need communicate with in-mesh services have workload-specific PeerAuthentication policies configured that sets the MTLS mode to PERMISSIVE for those in-mesh services.
Where it breaks down is if I change the cluster-wide mesh MTLS mode to PERMISSIVE - then Prometheus (which should still be using MTLS) fails and starts getting EOF messages when trying to complete TLS handshakes.
🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2022-05-04. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.
Created by the issue and PR lifecycle manager.
issue is still relevant. a service that gets traffic in permissive mode, cannot be scraped by job with tls_config
same here, when I have STRICT mode on with my ServiceMonitor configured with the below, it scrapes fine.
scheme: https
tls_config:
ca_file: /etc/prom-certs/root-cert.pem
cert_file: /etc/prom-certs/cert-chain.pem
key_file: /etc/prom-certs/key.pem
insecure_skip_verify: true # Prometheus does not support Istio security naming, thus skip verifying target pod certificate
However, if I remove my PeerAuthentication CR to change it back to PERMISSIVE mode, I get this error
Get "https://XXXXXX/metrics": read tcp XXXXXX->XXXXXX: read: connection reset by peer
This is working as expected. Would welcome an update on https://istio.io/latest/docs/ops/integrations/prometheus/#tls-settings to make this clear
same here as @andehpants
when we have STRICT with tls_config, it's working fine. But with PERMISSIVE mode, getting this error:
read tcp XXXXXX->XXXXXX: read: connection reset by peer
@howardjohn we have the same config you've shared in the doc: https://istio.io/latest/docs/ops/integrations/prometheus/#tls-settings but it's still not working with PERMISSIVE mode. Can you share any other config to be added/updated?
What needs to be updated is the doc to say you cannot use PERMISSIVE mode. It will not work. Istio identifies "Istio mTLS" by an ALPN and prometheus cannot set that.
The specific field that causes the problem in Permissive mode is scheme: https. If you remove that or set it to http instead then it will work.
But when you enable strict mode, it will break again until you set scheme: https.
However, I cannot explain why it won't work with scheme: https in PERMISSIVE mode.
@howardjohn Do you mind clarifying this please?