troubleshoot icon indicating copy to clipboard operation
troubleshoot copied to clipboard

Preflight checks for Redis fail after TLS is enabled

Open xavpaice opened this issue 3 years ago • 2 comments

Problem to solve

Preflight checks for Redis fail after TLS is enabled on the respective service. Currently, Preflight checks cannot be configured to utilize a CA bundle, certificate, or a private key to validate authentication.

Customer Impact

  • Increased support burden with support tickets related to the failed preflight checks.
  • Some customers refuse to upgrade to a later version due to preflight check failure

Describe possible workarounds

  • Disable TLS for Redis (not feasible)
  • Disable certificate checks entirely

Potential solution

Add collector options to the Redis collectors:

  • require server authentication,
  • passing a custom CA bundle to deal with self signed ones,
  • passing a certificate and a private key for client authentication purpose.

https://pkg.go.dev/github.com/go-redis/redis/v8#Options shows TLS config coming from the https://pkg.go.dev/crypto/tls#Config struct which is used in the Redis collector.

xavpaice avatar Sep 28 '22 07:09 xavpaice

Background

Redis and other systems such as databases can be deployed on 3 ways when it comes to TLS

  • No TLS
    • Connections between clients and server are not encrypted at all
    • Files needed
      • None
  • TLS
    • Connections are encrypted
    • The client verifies that a server's certificate was signed by a known CA.
    • The client verifies that the server indeed created the certificate by encrypting some random text using the server's public key (it's part of the cert), sending the cipher text to the server which the server decrypts and responds with proof.
    • Files needed
      • client: CA bundle (unless we skip verification)
      • server: CA bundle, server cert, server private key
  • mTLS (mutual TLS)
    • (performs the TLS steps above)
    • The server verifies the clients certificate and also ensures that the client can prove ownership through decrypting cipher text sent by the server.
    • Files needed
      • client: CA bundle (unless we skip verification), client cert, client private key
      • server: CA bundle, server cert, server private key

Troubleshoot limitation

When the redis collector connects to a server that is secured using TLS or mTLS, the client will require extra to pass in the necessary TLS input here, else the connections will not succeed. This collector can be part of a Preflight or SupportBundle spec

Available options

  • Extend the spec to allow passing in the optional input parameters required to successfully connect to redis. The optional parameters would be

    • secret containing this information i.e a yaml with these files in PEM format. This will only work when the collector is run within a cluster
    • Parameters to capture all three files as PEM format strings in the yaml manifests. This works in and out of the cluster. Such a spec needs to be kept secure if it were to have private keys. This is out of scope here though

    Sample spec

      - redis:
          collectorName: redis
          uri: rediss://default:password@hostname:6379
          # Optional
          tlsSecrets: my-tls-secrets
          ca: ...
          cert: ...
          key: ...
    
    # TLS input in a k8s secret
    ca: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----
    cert: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----
    key: |
      -----BEGIN PRIVATE KEY-----
      ...
      -----END PRIVATE KEY-----
    
  • Only test if the server is reachable when running preflights (full-fledge collector otherwise). This can be achieved by

    • Extending the collector's output to contain a isReachable field similar to isConnected which indicates that the server was reachable i.e a TCP connection could be made
    • Executing a command such as nc -w 3 <host> <port> using the exec collector
    • Adding a generic collector tests for a tcp connection i.e conn, err := net.Dial("tcp", "<host>:<port>").

banjoh avatar Oct 17 '22 13:10 banjoh

The data needed for these could (and likely already is) be added to config items for the application that's getting deployed - so the task for the Troubleshoot project is to simply allow those values to be passed via the yaml spec to the preflight binary so it can run.

xavpaice avatar Oct 18 '22 01:10 xavpaice

Hello, Any news on this issue? Thanks in advance :)

jbuchner-gg avatar Nov 17 '22 18:11 jbuchner-gg

@banjoh the branch you linked moved, could you update the development branch on this issue so the in-flight work is easier to find?

chris-sanders avatar Nov 21 '22 20:11 chris-sanders

@jbuchner-gg the development branch has been updated you'll find it on the right sidebar if you want to see current progress. It is in development currently.

chris-sanders avatar Nov 23 '22 13:11 chris-sanders

Thanks @chris-sanders 🙏 😄 . Such as this ticket had the label "triage", I feared not having changes soon.

jbuchner-gg avatar Nov 23 '22 14:11 jbuchner-gg

Understood we're using 'triage' as an indicator for a short list of issues which are well defined and ready for work. It doesn't necessarily imply a timeframe but that the issue is available and well defined. We tend to work those sooner than later.

chris-sanders avatar Nov 23 '22 14:11 chris-sanders

@jbuchner-gg work addressing this issue is in https://github.com/replicatedhq/troubleshoot/pull/870 PR. Feel free to review as well to see if its in line with your needs.

banjoh avatar Nov 23 '22 18:11 banjoh