pixie icon indicating copy to clipboard operation
pixie copied to clipboard

TLS Certificate Verification Fails for Inter-Service Communication When Using Custom Domain Names

Open Pger-Y opened this issue 6 months ago • 4 comments

Describe the bug When deploying Pixie Cloud with custom domain names (e.g., pixie.domain.com and work-pixie.domain.com instead of the standard work.pixie.domain.com pattern), inter-service gRPC communication fails with TLS certificate verification errors. The API server cannot connect to other services like auth-server, profile-server, artifact-tracker-server, etc., because the kubernetes gRPC resolver resolves to Pod names which are not included in the TLS certificates. To Reproduce Steps to reproduce the behavior: Deploy Pixie Cloud with custom domain configuration: Set PL_DOMAIN_NAME to pixie.domain.com Configure PL_WORK_DOMAIN to work-pixie.domain.com (peer-level domains instead of subdomain) Complete the deployment and try to access the Pixie UI Check the API server logs using: kubectl logs -n plc deployment/api-server See TLS certificate verification errors

Expected behavior Inter-service gRPC communication should work seamlessly regardless of the domain naming pattern used. Services should be able to connect to each other without TLS certificate verification failures.

Root Cause: The issue occurs because the default service_config.yaml uses kubernetes:/// gRPC resolver format for inter-service connections: Apply to auth.go . This resolver resolves to Pod IP addresses and uses Pod names for TLS verification, but the TLS certificates only contain Service-level domain patterns (*.plc, *.plc.svc.cluster.local). Workaround/Solution: Modify the k8s/cloud/base/service_config.yaml (and corresponding environment-specific configs) to use full FQDN format instead of kubernetes resolver: Apply to auth.go 50100 Apply this change to all services in the config and restart the affected deployments. Suggested Fix: Update the default service_config.yaml templates to use FQDN format by default Update documentation to mention this requirement when using custom domain configurations Consider making the kubernetes resolver more robust to handle TLS certificate validation properly This issue affects deployments with custom domain naming patterns and should be addressed to improve the out-of-box experience for users with certificate constraints.

Pger-Y avatar Jun 09 '25 13:06 Pger-Y

Hi @Pger-Y, the PL_DOMAIN_NAME and PL_WORK_DOMAIN settings do not apply to the service-tls-certs used by inter-service communication. In order to see the tls errors you are describing, it appears you may have changed these certificates or the hostnames used by Pixie to access other services in service_config.yaml.

The inter-service communication is already configured properly with an out-of-the-box install since its independent of any user provided settings (i.e. PL_DOMAIN_NAME, PL_WORK_DOMAIN, etc). Since this should be seamless and changing these certs would require a migration for all existing users, we would need to have a strong justification for changing the cert Common name or SANs.

We are more than happy to discuss improvements and appreciate you bringing this issue up, but at this time we intend to keep the service-tls-certs setup the same.

ddelnano avatar Jun 10 '25 14:06 ddelnano

I managed to deploy the plc and pl in my cluster,but i got error like this :

Image Image

🙏🙏🙏

Pger-Y avatar Jun 23 '25 11:06 Pger-Y

Hi @Pger-Y, the PL_DOMAIN_NAME and PL_WORK_DOMAIN settings do not apply to the service-tls-certs used by inter-service communication. In order to see the tls errors you are describing, it appears you may have changed these certificates or the hostnames used by Pixie to access other services in service_config.yaml.

The inter-service communication is already configured properly with an out-of-the-box install since its independent of any user provided settings (i.e. PL_DOMAIN_NAME, PL_WORK_DOMAIN, etc). Since this should be seamless and changing these certs would require a migration for all existing users, we would need to have a strong justification for changing the cert Common name or SANs.

We are more than happy to discuss improvements and appreciate you bringing this issue up, but at this time we intend to keep the service-tls-certs setup the same.

Yes, I previously modified several configurations to align with our domain setup, which unfortunately led to certificate mismatch errors. :( I've since rolled back to the official guide, but I still have a question regarding the request flow from Ingress to cloud-proxy-service. The Ingress is configured to use a TLS certificate covering pixie.xxx.com and work.pixie.xxx.com. However, when the request is proxied to the backend service (cloud-proxy-service), it uses the internal Kubernetes service name (e.g., https://cloud-proxy-service), which doesn’t match the certificate’s domain. Could this be the reason for the SSL errors I encountered? Or is there something I’m missing?

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: cloud-ingress-https
  namespace: plc
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - pixie.example.com
    - work.pixie.example.com
    secretName: cloud-proxy-tls-certs
  rules:
  - host: pixie.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: cloud-proxy-service
            port:
              number: 443
  - host: work.pixie.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: cloud-proxy-service
            port:
              number: 443

Pger-Y avatar Jun 26 '25 07:06 Pger-Y

Image same error while ingress proxy request to vzconn-server :( because vzconn-server use cloud-proxy-tls-certs as certificate while ingress access vzconn with his serviceName vzconn-server which does not match name in certificate

Pger-Y avatar Jun 27 '25 09:06 Pger-Y