consul
consul copied to clipboard
Reconfiguring Connect CA to Built-in from Vault doesn't write Certs
Overview of the Issue
I'm on Consul 1.13.6 currently, I am using the connect/ca/configuration
endpoint to reconfigure from using the Vault CA provider, to the Consul Built in.
I'm setting it such that my config does not include a root or private key, and have removed the configuration for the paths to them in my servers configuration.
When I do this and restart consul, I can hit the connect/ca/roots
endpoint and see the new root certificate, however it doesn't look like its written anywhere, nor is a private key. I am also unable to specify cert_file
and key_file
in the hops that if they are not there, it writes them.
As such it makes this difficult to be idempotent, as I cannot use something like vault to store these generated values, and consul fails to start as "no cert_file is specified" or "key_file does not match cert_file".
Reproduction Steps
- Configure Consul with a valid Vault CA provider configuration
- Use the
connect/ca/configuration
endpoint with a basic configuration of:
{
"Provider":"consul",
"ForceWithoutCrossSigning":true
}
- remove the
cert_file
andca_file
keys - restart consul
Consul info for both Client and Server
Server info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 61547a41
version = 1.13.6
version_metadata =
consul:
acl = enabled
bootstrap = false
known_datacenters = 2
leader = false
leader_addr =
server = true
raft:
applied_index = 12323864
commit_index = 0
fsm_pending = 0
last_contact = never
last_log_index = 12333022
last_log_term = 205821
last_snapshot_index = 12323864
last_snapshot_term = 205817
latest_configuration = [{REDACTED}]
latest_configuration_index = 0
num_peers = 0
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 205821
runtime:
arch = amd64
cpu_count = 2
goroutines = 101
max_procs = 2
os = linux
version = go1.18.9
serf_lan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 259
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 130662
members = 4
query_queue = 0
query_time = 50
serf_wan:
coordinate_resets = 0
encrypted = true
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 38107
members = 6
query_queue = 0
query_time = 1
Operating system and Environment details
EC2 Amazon Linux 2 AWS
Hi @reskin89,
Thank you for reaching out. I hope to better understand what you're hoping to accomplish and anything we might be able to make clearer for someone in the same position in the future (e.g., clarifying docs). With that context, let's dig in!
Background
Consul has two different PKIs:
- Connect PKI: Used for services in the mesh (and also client agents if using auto-config/encrypt)
- Agent TLS PKI: Used for server agents (and client agents if not using auto-config/encrypt)
The cert_file
and key_file
are the public certificate and corresponding private key for the agent TLS certificate used by the server agent to authenticate as a server agent when communicating with other agents (servers and clients). ca_file
is the public certificate for the CA from which cert_file
and key_file
were issued. Assuming you have TLS enabled for agent communication (e.g., verify_outgoing, verify_incoming, and verify_server_hostname set to true), I wouldn't expect that server agent to be able to communicate with other agents. I haven't traced the codepaths to check whether the particular error messages you observed would be triggered, but it seems reasonable that they might be.
I wouldn't expect this to have anything to do with the completely separate Connect PKI. In other words, I would guess that you would get the same result by skipping reproduction steps 1-2.
Clarifications
Out of curiosity, what originally led you to consider switching from the Vault provider to the Consul provider for the Connect CA? And what led you down the path of trying to change ca_file
, cert_file
, and key_file
as a part of this?
Maybe there's a better way to accomplish what you're trying to accomplish, or a feature request that comes out of this. There may also be documentation improvements to clarify things here.
@jkirschner-hashicorp Ryan and I work together. The reason for the switch to Consul CA Provider is due to the following issue: https://github.com/hashicorp/consul/issues/11685
Ah, yes - I thought the Github handle looked familiar!
With that context, it sounds like ca_file
, cert_file
, and key_file
are a red herring (because they are unrelated to the Connect CA). Perhaps they were removed in a desire to ensure the Consul CA provider was generating its own public/private key pair rather than using an existing one? The ca_file
, cert_file
, and key_file
should be left as-is.
If I understand correctly, you called the Update CA Configuration API endpoint without specifying Config.PrivateKey
and Config.RootCert
in the request payload, so the provider will generate its own public/private key pair. But there's no endpoint (as far as I know) to ask Consul what that private key is, so you have no ability to store the generated key pair elsewhere (e.g., Vault).
What are your thoughts on generating the PrivateKey
and RootCert
first, saving them in Vault, and then submitting them in the Update CA Configuration call?
If I understand correctly, you called the Update CA Configuration API endpoint without specifying Config.PrivateKey and Config.RootCert in the request payload, so the provider will generate its own public/private key pair. But there's no endpoint (as far as I know) to ask Consul what that private key is, so you have no ability to store the generated key pair elsewhere (e.g., Vault).
That's correct.
What are your thoughts on generating the PrivateKey and RootCert first, saving them in Vault, and then submitting them in the Update CA Configuration call?
That sounds good but in this instance we are not using Vault. Would it be possible to write the files locally instead?
That was my next step, to generate my own. It makes it easier to securely store these in the event there's some type of critical failure, that way once fixed we don't have to distribute a new public key to all of our clients.
Thank you for the clarification as well! Understand the other config parameters more is an enormous help
@jkirschner-hashicorp I tried generating my own certificates. The Primary datacenter in my cluster takes the root just fine, however it seems the secondaries run a verification of the certificate I provide w/ the crypto/x509 library which seems to be throwing an error because its self signed.
Its attempting to extract an intermediate chain but there isn't one to extract, we're going pure root certificate here.
The error I receive when updating the secondary data center:
rpc error making call: Error updating secondary datacenter CA config: Failed to set the intermediate certificate with the CA provider: could not verify intermediate cert against root: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "serial:<SERIALNUMBER>")
after digging through the consul code I arrived at the root of these errors from crypto/x509. I'm unsure what its attempting to verify the certificate against, if not the primary datacenters root?
Its almost as if the secondary datacenters are expecting an intermediate in that 'rootCert' configuration param, and it thinks I'm providing one that's self signed
@reskin89 : Does the same problem occur in secondary if you never specify PrivateKey
and RootCert
in any datacenter (which causes Consul to generate its own)?
I assume you're specifying the self-signed cert details (PrivateKey
and RootCert
) in the primary datacenter's connect CA configuration. Are you setting PrivateKey
and RootCert
to anything in the secondary datacenters? (It's not actually clear to me from the Consul CA provider docs whether or not to set those in secondary DCs. We should clarify in the docs.)
Answers to one or both of these questions might help us narrow down what's happening faster.
I am actually setting the private key and root cert in the secondaries. I guess thinking about it, since they delegate the root responsibility to the primary, I should potentially submit the change to those with no config and let it self generate? It would then delegate to the secondary anyway even if it generates its own root correct?
My guess is that you don't need to specify anything for PrivateKey
and RootCert
in the secondary datacenters, and that the secondary datacenters will generate a CSR that they ask the primary datacenter to sign (which delegates to the CA provider of the primary DC). I'm not sure though. You can likely test it faster than I can look into it.
That actually seems like the most logical route. I'll test shortly and report my findings.
Thanks for all the help!
Interestingly, that results in the same error
More info, I used my old vault config to reset my development cluster, and the secondaries are also spitting out that error still as well
I will correct myself here.
I was able to revert to my old vault config in the secondaries after hard reboots of the servers. Something must have been stuck in memory (even after consul service restarts)
@reskin89 : It seems like you've reverted back to the old state (with the Vault CA provider in the primary and secondaries). What happens if you try moving to the Consul CA provider at this point, setting PrivateKey
and RootCert
only in the primary DC? (Or what have you tried since?)
Also, this is tangential, but why set ForceWithoutCrossSigning to true? That will cause temporary connection failures until service mesh proxies (and client agents if using auto-config/encrypt) get a new cert.
I've done this previously and it results in the same errors in the secondaries.
If cross signing isn't set during this operation the API rejects it
If cross signing isn't set during this operation the API rejects it
Do you get any relevant log messages when the rejection occurs?
Both the Vault and Consul CA providers support cross-signing, so I'm not sure why ForceWithoutCrossSigning
would be needed here.
Really sorry for the delay here @jkirschner-hashicorp , been in the weeds with the consul lambda stuff 😄 .
I will test this when I'm able to circle back to this, I need my environment stable at the moment
@reskin89: When you do circle back to this, I thought of a reason why the API might reject the cross-sign request. The Vault ACL token that Consul's Vault CA provider uses to ask the PKI Secrets Engine to perform the cross-sign operation needs elevated privileges: https://developer.hashicorp.com/consul/docs/connect/ca/vault#additional-vault-acl-policies-for-sensitive-operations
It's worth checking whether the Vault ACL token configured in the provider has the privileges needed for that cross-sign call to work.
That makes a whole lot of sense when you say that, I'll definitely give that a look. Thank you!!
@jkirschner-hashicorp I've finally circled back to this.
Your recommendation on vault privileges worked, but now I'm in another spot where my secondary datacenters are failing.
I followed this doc: https://developer.hashicorp.com/consul/tutorials/security/tls-encryption-secure
to generate a CA, cert, and key file from my primary datacenter.
I applied them to my config.json for my server/agent config and my primary datacenter is up and running without error.
However, I tried to distribute the CA, cert, and keys to my secondaries and they're in a failing state.
One of my datacenters is just repeatedly stating that the CA and ACLs are still initializing, and its also just repeatedly stating "no cluster leader"
➜ ~ consul monitor -log-level debug
2023-12-18T16:57:11.340Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:11.340Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:12.341Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:12.341Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:12.754Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:62982 latency="181.953µs"
2023-12-18T16:57:12.758Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/peers from=127.0.0.1:62990 latency="63.234µs"
2023-12-18T16:57:13.341Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:13.341Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:14.342Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:14.342Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:15.342Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:15.343Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:16.343Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:16.343Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:17.071Z [DEBUG] agent.server.memberlist.wan: memberlist: Initiating push/pull sync with: REDACTED:8302
2023-12-18T16:57:17.344Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:17.344Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:18.344Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:18.344Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:19.344Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:19.344Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:20.345Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:20.345Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:21.345Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:21.345Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:22.057Z [ERROR] agent.http: Request error: method=GET url=/v1/health/state/any from=127.0.0.1:62996 error="No cluster leader"
My other secondary is repeatedly stating it has a bad tls certificate (I ran consul tls cert create -server
for each datacenter and distributed the certificates accordingly) along with "no cluster leader" and seemingly no leader elections.
my new CA config:
{
"Provider": "consul",
"Config": {
"IntermediateCertTTL": "8760h",
"LeafCertTTL": "72h",
"PrivateKeyBits": 256,
"RootCertTTL": "87600h",
"RotationPeriod": "2160h"
},
"State": null,
"ForceWithoutCrossSigning": false,
"CreateIndex": 8,
"ModifyIndex": 15770485
}
I've also performed a consul join -wan
between all of my datacenters as well to no avail.
@reskin89 : Were the secondary datacenters (with WAN federation) previously coming up fine? The agent TLS PKI is separate from the service mesh PKI, so I wouldn't expect a change to your service mesh PKI provider to affect anything about agent TLS or how you'd set it up for WAN fed.
(As an aside: I think our documentation could do a better job conveying these topics.)
Yep, all datacenters were known working in a fine state prior to this update. I updated the CA config after generating the certs in the document mentioned, then I distributed them to all datacenters and restarted consul.
The primary is just fine which is what's interesting. What I'm now witnessing is that one of the secondaries finally selected a leader, but its unable to sign leafs stating it has no root, but if I hit /v1/connect/ca/roots
I clearly see the previous root from vault (since there are some nodes that still have leaf's from it) and the Consul CA Primary Cert.
2023-12-18T17:38:16.115Z [DEBUG] agent: Node info in sync
2023-12-18T17:38:16.124Z [WARN] agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA is uninitialized and unable to sign certificates yet: no root certificate" index=1
2023-12-18T17:38:17.252Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:49512 latency="106.265µs"
2023-12-18T17:38:17.255Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/peers from=127.0.0.1:49520 latency="92.674µs"
2023-12-18T17:38:17.899Z [ERROR] agent.server.rpc: failed to read byte: conn=from=10.142.79.189:41514 error="remote error: tls: bad certificate"
The other secondary I have is still stuck stating it has no leader and CA isn't initialized.
I'm registering to the primary with the new certificates just fine, so part of me is wondering if my additional-dnsnames aren't good? Per the docs it looks like it should just be *.DATACENTER.consui
if I've kept domain defaults which I have.
@reskin89 : The document you followed is about the agent TLS PKI, which is what Consul server and client agents use to authenticate with each other for RPC communication.
That's entirely separate from the service mesh PKI, which allows services in the mesh to authenticate with each other. In changing your service mesh PKI (aka "connect CA") provider from Vault to Consul, there's nothing about the agent TLS PKI that needed to change.
Unfortunately, the term "built-in CA" is overloaded in Consul's documentation to refer to both:
- "connect CA Consul provider" (a service mesh PKI construct)
- a helper command (
consul tls ca/cert
) that can be used to create agent TLS PKI certificates
That overloaded term makes it even harder to tell that the agent TLS PKI and service mesh PKI are separate systems. The docs should be updated to disambiguate.
If you revert your agent TLS config (e.g., cert_file
, ca_file
, key_file
) back to what it was when the secondary DCs were working, that should help.
That would lead me back to my initial issue, my agents were getting x509 certificate signed by unknown authority
after I updated to the consul CA. I attempted to distribute to them, the root, the intermediate, and a chain with the root and intermediate to no avail.
After I ran consul tls ca create
and distributed the ca created there to my agent, it came online no problem (in my primary).
Is that step also not related, or is that part generated the CA chain that's signed by the root of the connect CA?
I'll revert my configs and see what happens
After config revert, one secondary:
Dec 18 17:55:09 us-east-1-01 consul[8350]: agent: Synced node info
Dec 18 17:55:09 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 17:55:12 us-east-1-01 consul[8350]: agent: (LAN) joined: number_of_nodes=3
Dec 18 17:55:12 us-east-1-01 consul[8350]: agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
Dec 18 17:55:16 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 17:55:25 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 17:55:53 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
second secondary:
Dec 18 17:58:01 us-west-2-02 consul[8379]: agent: Synced node info
Dec 18 17:58:02 us-west-2-02 consul[8379]: 2023-12-18T17:58:02.394Z [WARN] agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA is uninitialized and unable to sign certificates yet: no root certificate" index=1
Dec 18 17:58:02 us-west-2-02 consul[8379]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA is uninitialized and unable to sign certificates yet: no root certificate" index=1```
@reskin89 : Just to confirm, did you change the service mesh CA config by calling the CLI or API endpoint? Or by changing the agent configuration ca_config stanza?
I ask because the agent config stanza ca_config is only used when "initially bootstrapping the cluster" (according to the docs).
I changed it via the cli with consul connect ca set-config
I found a stale json config that had a connect
block in it for ca configs so I removed it. Now at this juncture its throwing the x509's again, and when I try to login to the UI of the secondaries I get this:
Dec 18 18:27:02 us-east-1-01 consul[14986]: 2023-12-18T18:27:02.772Z [WARN] agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 18:27:02 us-east-1-01 consul[14986]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
II should clarify, I get that x509 error logging in to one of my secondary's, the other I can run consul commands against, if I try to switch datacenters with the drop down in the UI, it tells me none of those servers are reachable 🤷