external-dns icon indicating copy to clipboard operation
external-dns copied to clipboard

feat(aws): add tls ca provision

Open hjoshi123 opened this issue 9 months ago • 24 comments

Description

Adds TLS CA, TLS Client Cert and key to AWS Provider.

Uses custom http transport to feed to aws config default options if tls config is provided.

Fixes #5026

Checklist

  • [x] Unit tests updated
  • [x] End user documentation updated

hjoshi123 avatar Feb 15 '25 06:02 hjoshi123

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign szuecs for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Feb 15 '25 06:02 k8s-ci-robot

Hi @hjoshi123. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Feb 15 '25 06:02 k8s-ci-robot

/retitle feat(aws): add tls ca provision /ok-to-test @hjoshi123 Would you please update documentation explaining users that it exists and how to use it ?

mloiseleur avatar Feb 15 '25 08:02 mloiseleur

@ivankatliarchuk @mloiseleur not sure what is up with the lint job. In my local make go-lint is working without any errors. This is the error in CI. The tests are passing now

jsonschema: "linters-settings" does not validate with "/properties/linters-settings/additionalProperties": additional properties 'maligned' not allowed
jsonschema: "run" does not validate with "/properties/run/additionalProperties": additional properties 'exclude-files' not allowed
Failed executing command with error: the configuration contains invalid elements
, Error: Command failed: /home/runner/golangci-lint-1.63.4-linux-amd64/golangci-lint config verify
jsonschema: "linters-settings" does not validate with "/properties/linters-settings/additionalProperties": additional properties 'maligned' not allowed
jsonschema: "run" does not validate with "/properties/run/additionalProperties": additional properties 'exclude-files' not allowed
Failed executing command with error: the configuration contains invalid elements

    at genericNodeError (node:internal/errors:984:15)
    at wrappedFn (node:internal/errors:5[38](https://github.com/kubernetes-sigs/external-dns/actions/runs/13348358593/job/37281685858?pr=5097#step:7:41):14)
    at ChildProcess.exithandler (node:child_process:422:12)
    at ChildProcess.emit (node:events:519:28)
    at maybeClose (node:internal/child_process:1105:16)
    at ChildProcess._handle.onexit (node:internal/child_process:305:5)
Error: Command failed: /home/runner/golangci-lint-1.63.4-linux-amd64/golangci-lint config verify
jsonschema: "linters-settings" does not validate with "/properties/linters-settings/additionalProperties": additional properties 'maligned' not allowed
jsonschema: "run" does not validate with "/properties/run/additionalProperties": additional properties 'exclude-files' not allowed
Failed executing command with error: the configuration contains invalid elements

hjoshi123 avatar Feb 15 '25 20:02 hjoshi123

Apologies there going to be a lot of text, but I would suggest to extract roots, _ = x509.SystemCertPool() logic into separate pull request, as I could not approve it, and this require approval/review from project owners.

So what most likely should go in documentation in some form as a recomendation

  • The recommended approach is not to package certicates with the docker/code/system unless it's a close-source software.
  • In scope of work, the open source service external-dns is going to make a call to a hyperscaler via a proxy, and external-dns know nothing about certicate that is required for proxy, so in theory it should not be available on system level. I made this up, not sure how to explain correctly.
  • To alligh with Industry standards is to issue a short lived intermediate certificates and mount them in the same mount or split them by TTL lifetime
  • ^ one could re-package application, if there is a requirement to package application and certificate instead of mounting them at runtime
  • It very nice to support multiple mount point location for certificates, for cases with multiple issuers, or different TTL, but let's not overcomplicate this, so we could ignore that

ivankatliarchuk avatar Feb 15 '25 21:02 ivankatliarchuk

I would reference here The Adobe Common Controls Framework (CCF), here is a link https://www.adobe.com/trust/compliance/adobe-ccf.html, there are around 4000+ controls, but we only interested in few very relevant

The Go code involving x509.CertPool relates to several key security domains within CCF.

Relevant CCF Controls for x509.CertPool Security

1. Cryptographic Controls & Certificate Management

Related CCF Domains:

  • CCF-CM-1: Use of Strong Cryptographic Standards
  • CCF-CM-2: Certificate and Key Management
  • CCF-CM-3: Trusted Certificate Authorities

How It Relates to the Code:

  • Using x509.SystemCertPool() ensures system-trusted certificates are used, aligning with CCF’s requirement to rely on approved CAs.
  • Using an empty x509.CertPool and adding/mounting CAs aligns with stricter cryptographic controls by allowing explicit trust settings.

Recommendation:

  • Ensure certificates are sourced from a trusted and validated CA.
  • Regularly update and rotate CA certificates to comply with lifecycle management policies.

2. System & Network Security

Related CCF Domains:

  • CCF-NS-3: Secure TLS Configurations
  • CCF-NS-5: Prevent Unauthorized Certificate Authorities

How It Relates to Your Code:

  • If the system CA store is compromised, x509.SystemCertPool() could trust unverified CAs.
  • A custom CertPool prevents unauthorized certificates, improving security.

Recommendation:

  • Use TLS best practices (e.g., enforcing strong ciphers and TLS 1.2+).
  • Monitor system CA stores for unauthorized modifications.
  • Prefer explicit trust models (Option 2) in high-security environments.

3. Error Handling & Secure Coding Practices

Related CCF Domains:

  • CCF-SD-1: Secure Software Development Lifecycle (SDLC)
  • CCF-SD-3: Proper Error Handling

How It Relates to Code:

  • Ignoring errors (_ = x509.SystemCertPool()) is a bad security practice.
  • If x509.SystemCertPool() fails (e.g., Windows lacks support), a nil roots could cause panics or security failures.

Recommendation:

  • Always check for errors when calling x509.SystemCertPool().
  • Implement logging and alerting for failures.
Option Security Implication CCF Compliance Alignment
Option 1 (SystemCertPool()) Trusts system CAs (convenient but can be risky) ✅ Meets CCF-CM-3 (Trusted CAs) but ❌ may not meet CCF-NS-5 (Unauthorized CAs)
Option 2 (Empty CertPool) Only explicitly added CAs are trusted ✅ Stronger security, aligns with CCF-NS-5 and CCF-CM-2

ivankatliarchuk avatar Feb 15 '25 21:02 ivankatliarchuk

@ivankatliarchuk sure I can change it back to start form empty cert pool no issues.. We can make that as part of a separate discussion

hjoshi123 avatar Feb 15 '25 21:02 hjoshi123

@ivankatliarchuk do you have any idea about the ci lint job error?

hjoshi123 avatar Feb 15 '25 21:02 hjoshi123

I'll try to find a solution for linter. We recently added few things, looks like a bug was introduced

ivankatliarchuk avatar Feb 15 '25 21:02 ivankatliarchuk

@ivankatliarchuk I can debug and try to fix it in a different PR if its fine by you?

hjoshi123 avatar Feb 15 '25 21:02 hjoshi123

A fix for this issue appears to be available in another pull request (https://github.com/kubernetes-sigs/external-dns/pull/5085/files). Hopefully, it will be reviewed and approved soon.

ivankatliarchuk avatar Feb 15 '25 21:02 ivankatliarchuk

A fix for this issue appears to be available in another pull request (https://github.com/kubernetes-sigs/external-dns/pull/5085/files). Hopefully, it will be reviewed and approved soon.

Ah I see.. Let's wait for it then.. otherwise does the PR look good to you?

hjoshi123 avatar Feb 15 '25 21:02 hjoshi123

A fix for this issue appears to be available in another pull request (https://github.com/kubernetes-sigs/external-dns/pull/5085/files). Hopefully, it will be reviewed and approved soon.

Ah I see.. Let's wait for it then.. otherwise does the PR look good to you?

Few things to consider, and something like a test on real-cluster/local-cluster will speed up review.

As @mloiseleur pointed out, you need to consider what to add to a documentation, maybe as well provide an example setup with a proxy of your choice in the docs on top of the description/recomendations.

ivankatliarchuk avatar Feb 15 '25 21:02 ivankatliarchuk

A fix for this issue appears to be available in another pull request (https://github.com/kubernetes-sigs/external-dns/pull/5085/files). Hopefully, it will be reviewed and approved soon.

Ah I see.. Let's wait for it then.. otherwise does the PR look good to you?

Few things to consider, and something like a test on real-cluster/local-cluster will speed up review.

As @mloiseleur pointed out, you need to consider what to add to a documentation, maybe as well provide an example setup with a proxy of your choice in the docs on top of the description/recomendations.

Hmm.. I am not sure how to test it in real world since I dont have a proxy working or any real world ca-certs. I can work on the documentation though. My unit test actually tries to mimic the real world scenario that the issue was talking about.. it creates a root CA first and then uses the rootCA to issue child certs and child keys which is then fed to aws config through http transport.. but if you have better suggestions might help. Meanwhile I can check on the documentation

hjoshi123 avatar Feb 15 '25 21:02 hjoshi123

I'll throw some ideas

  • [ ] There is an issue, so worth to ask a requestor to share his setup, he may be able to point out or explain things around how to configure proxy and certificates.
  • [ ] How AWS credentials provided to external-dns when it's expected to run behind a proxy
  • [ ] Instrument a local proxy with certificates in minikube or some other cluster of you choice
  • [ ] Review this code there is an example how to instrument aws-sdk-go-v2 behind a proxy. Not as useful code, but go with proxy example
  • [ ] The unit tests should have a proxy setup and aws stubs something like test -> HTTP proxy -> calls to fake AWS provider

ivankatliarchuk avatar Feb 15 '25 22:02 ivankatliarchuk

@ivankatliarchuk I didnt think it needed to be that detailed since the idea is similar to the rfc2136 provider where similar kind of tests are being done for TLS. @mloiseleur what do you think? Definitely worth asking the requestor if he can share his setup

From the credentials perspective I dont think anything changes since we are just providing TLS certs for the traffic to flow.. the environmental values or the way AWS picks up creds should be same

hjoshi123 avatar Feb 15 '25 22:02 hjoshi123

Np, I think I shared enough information to consider. Could be that I did not spend enough time to understand the original issue, and this is my understanding of client with forward proxy

Screenshot 2025-02-15 at 22 31 44

I found this code for older go-sdk version, that nicely describes go client behind a proxy case https://gist.github.com/jakexks/2f876697dfca1fe15b92f7bb6032780d

ivankatliarchuk avatar Feb 15 '25 22:02 ivankatliarchuk

@ivankatliarchuk that makes sense what you sent to check if the proxy is working and if the request coming from the proxy with the TLS CA Cert is trusted by AWS to generate the config. Imo getting the raw credentials through profile or environmental variables wouldnt change since that is dictated by the actual client the only thing that changes is how the request is made.

In summary, I agree we need a better way to test the use case of proxy, but I feel from a unit test perspective that is beyond the scope of unit test. @mloiseleur What do you suggest?

It would be nice if @cavdhut could give us more insights on how he uses the setup and if the changes made accommodate the use case he spoke about.

hjoshi123 avatar Feb 16 '25 07:02 hjoshi123

@hjoshi123 On RFC2136 provider, there is a test ensuring that all parameters are effectively taken into account, see https://github.com/kubernetes-sigs/external-dns/blob/8a69069c36b156d3e16795d31a3fba6c41204050/provider/rfc2136/rfc2136_test.go#L425-L495

Wdyt of doing the same for aws provider ?

Hmm.. I am not sure how to test it in real world since I dont have a proxy working or any real world ca-certs.

This project is widely used. We cannot merge something that has not been properly tested. For a simple manual test, you can setup locally a small proxy with TLS auth.

mloiseleur avatar Feb 17 '25 08:02 mloiseleur

@mloiseleur I can try the stub provider as its shown in the RFC2136 and push it.. essentially in the unit test, I am doing the same thing, I am creating a CA using that to sign a cert and then using that as TLSConfig... given that part is already there I can try the stub to see if the provider works as required.. will clean up some of the code as you mentioned.. does that work?

hjoshi123 avatar Feb 18 '25 03:02 hjoshi123

sure, it sounds good.

mloiseleur avatar Feb 18 '25 21:02 mloiseleur

Without reproducable setup on local machine with mininkute or kind, this going to be on hold. As need to be proper smoke tested. Access to AWS via proxy and rfc2136 have not much in common.

This is the high-lvel architecture diagram for an infrastructure, the code is not following it

Untitled-2024-05-21-1423 excalidraw(15)

ivankatliarchuk avatar Mar 04 '25 09:03 ivankatliarchuk

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Mar 12 '25 21:03 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 10 '25 22:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jul 10 '25 22:07 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Aug 09 '25 23:08 k8s-triage-robot

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 09 '25 23:08 k8s-ci-robot