consul Reconfiguring Connect CA to Built-in from Vault doesn't write Certs

Overview of the Issue

I'm on Consul 1.13.6 currently, I am using the connect/ca/configuration endpoint to reconfigure from using the Vault CA provider, to the Consul Built in.

I'm setting it such that my config does not include a root or private key, and have removed the configuration for the paths to them in my servers configuration.

When I do this and restart consul, I can hit the connect/ca/roots endpoint and see the new root certificate, however it doesn't look like its written anywhere, nor is a private key. I am also unable to specify cert_file and key_file in the hops that if they are not there, it writes them.

As such it makes this difficult to be idempotent, as I cannot use something like vault to store these generated values, and consul fails to start as "no cert_file is specified" or "key_file does not match cert_file".

Reproduction Steps

Configure Consul with a valid Vault CA provider configuration
Use the connect/ca/configuration endpoint with a basic configuration of:

{
    "Provider":"consul", 
    "ForceWithoutCrossSigning":true
}

remove the cert_file and ca_file keys
restart consul

Consul info for both Client and Server

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 61547a41
	version = 1.13.6
	version_metadata =
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 2
	leader = false
	leader_addr =
	server = true
raft:
	applied_index = 12323864
	commit_index = 0
	fsm_pending = 0
	last_contact = never
	last_log_index = 12333022
	last_log_term = 205821
	last_snapshot_index = 12323864
	last_snapshot_term = 205817
	latest_configuration = [{REDACTED}]
	latest_configuration_index = 0
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 205821
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 101
	max_procs = 2
	os = linux
	version = go1.18.9
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 259
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 130662
	members = 4
	query_queue = 0
	query_time = 50
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 38107
	members = 6
	query_queue = 0
	query_time = 1

Operating system and Environment details

EC2 Amazon Linux 2 AWS

Apr 11 '23 00:04 reskin89

Hi @reskin89,

Thank you for reaching out. I hope to better understand what you're hoping to accomplish and anything we might be able to make clearer for someone in the same position in the future (e.g., clarifying docs). With that context, let's dig in!

Background

Consul has two different PKIs:

Connect PKI: Used for services in the mesh (and also client agents if using auto-config/encrypt)
Agent TLS PKI: Used for server agents (and client agents if not using auto-config/encrypt)

The cert_file and key_file are the public certificate and corresponding private key for the agent TLS certificate used by the server agent to authenticate as a server agent when communicating with other agents (servers and clients). ca_file is the public certificate for the CA from which cert_file and key_file were issued. Assuming you have TLS enabled for agent communication (e.g., verify_outgoing, verify_incoming, and verify_server_hostname set to true), I wouldn't expect that server agent to be able to communicate with other agents. I haven't traced the codepaths to check whether the particular error messages you observed would be triggered, but it seems reasonable that they might be.

I wouldn't expect this to have anything to do with the completely separate Connect PKI. In other words, I would guess that you would get the same result by skipping reproduction steps 1-2.

Clarifications

Out of curiosity, what originally led you to consider switching from the Vault provider to the Consul provider for the Connect CA? And what led you down the path of trying to change ca_file, cert_file, and key_file as a part of this?

Maybe there's a better way to accomplish what you're trying to accomplish, or a feature request that comes out of this. There may also be documentation improvements to clarify things here.

Apr 17 '23 23:04 jkirschner-hashicorp

@jkirschner-hashicorp Ryan and I work together. The reason for the switch to Consul CA Provider is due to the following issue: https://github.com/hashicorp/consul/issues/11685

Apr 18 '23 13:04 GordonMcKinney

Ah, yes - I thought the Github handle looked familiar!

With that context, it sounds like ca_file, cert_file, and key_file are a red herring (because they are unrelated to the Connect CA). Perhaps they were removed in a desire to ensure the Consul CA provider was generating its own public/private key pair rather than using an existing one? The ca_file, cert_file, and key_file should be left as-is.

If I understand correctly, you called the Update CA Configuration API endpoint without specifying Config.PrivateKey and Config.RootCert in the request payload, so the provider will generate its own public/private key pair. But there's no endpoint (as far as I know) to ask Consul what that private key is, so you have no ability to store the generated key pair elsewhere (e.g., Vault).

What are your thoughts on generating the PrivateKey and RootCert first, saving them in Vault, and then submitting them in the Update CA Configuration call?

Apr 18 '23 15:04 jkirschner-hashicorp

If I understand correctly, you called the Update CA Configuration API endpoint without specifying Config.PrivateKey and Config.RootCert in the request payload, so the provider will generate its own public/private key pair. But there's no endpoint (as far as I know) to ask Consul what that private key is, so you have no ability to store the generated key pair elsewhere (e.g., Vault).

That's correct.

What are your thoughts on generating the PrivateKey and RootCert first, saving them in Vault, and then submitting them in the Update CA Configuration call?

That sounds good but in this instance we are not using Vault. Would it be possible to write the files locally instead?

Apr 18 '23 15:04 GordonMcKinney

That was my next step, to generate my own. It makes it easier to securely store these in the event there's some type of critical failure, that way once fixed we don't have to distribute a new public key to all of our clients.

Thank you for the clarification as well! Understand the other config parameters more is an enormous help

Apr 18 '23 15:04 reskin89

@jkirschner-hashicorp I tried generating my own certificates. The Primary datacenter in my cluster takes the root just fine, however it seems the secondaries run a verification of the certificate I provide w/ the crypto/x509 library which seems to be throwing an error because its self signed.

Its attempting to extract an intermediate chain but there isn't one to extract, we're going pure root certificate here.

The error I receive when updating the secondary data center:

rpc error making call: Error updating secondary datacenter CA config: Failed to set the intermediate certificate with the CA provider: could not verify intermediate cert against root: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "serial:<SERIALNUMBER>")

after digging through the consul code I arrived at the root of these errors from crypto/x509. I'm unsure what its attempting to verify the certificate against, if not the primary datacenters root?

Apr 19 '23 18:04 reskin89

Its almost as if the secondary datacenters are expecting an intermediate in that 'rootCert' configuration param, and it thinks I'm providing one that's self signed

Apr 19 '23 18:04 reskin89

@reskin89 : Does the same problem occur in secondary if you never specify PrivateKey and RootCert in any datacenter (which causes Consul to generate its own)?

I assume you're specifying the self-signed cert details (PrivateKey and RootCert) in the primary datacenter's connect CA configuration. Are you setting PrivateKey and RootCert to anything in the secondary datacenters? (It's not actually clear to me from the Consul CA provider docs whether or not to set those in secondary DCs. We should clarify in the docs.)

Answers to one or both of these questions might help us narrow down what's happening faster.

Apr 21 '23 13:04 jkirschner-hashicorp

I am actually setting the private key and root cert in the secondaries. I guess thinking about it, since they delegate the root responsibility to the primary, I should potentially submit the change to those with no config and let it self generate? It would then delegate to the secondary anyway even if it generates its own root correct?

Apr 21 '23 13:04 reskin89

My guess is that you don't need to specify anything for PrivateKey and RootCert in the secondary datacenters, and that the secondary datacenters will generate a CSR that they ask the primary datacenter to sign (which delegates to the CA provider of the primary DC). I'm not sure though. You can likely test it faster than I can look into it.

Apr 21 '23 13:04 jkirschner-hashicorp

That actually seems like the most logical route. I'll test shortly and report my findings.

Thanks for all the help!

Apr 21 '23 13:04 reskin89

Interestingly, that results in the same error

Apr 21 '23 14:04 reskin89

More info, I used my old vault config to reset my development cluster, and the secondaries are also spitting out that error still as well

Apr 21 '23 14:04 reskin89

I will correct myself here.

I was able to revert to my old vault config in the secondaries after hard reboots of the servers. Something must have been stuck in memory (even after consul service restarts)

Apr 21 '23 15:04 reskin89

@reskin89 : It seems like you've reverted back to the old state (with the Vault CA provider in the primary and secondaries). What happens if you try moving to the Consul CA provider at this point, setting PrivateKey and RootCert only in the primary DC? (Or what have you tried since?)

Also, this is tangential, but why set ForceWithoutCrossSigning to true? That will cause temporary connection failures until service mesh proxies (and client agents if using auto-config/encrypt) get a new cert.

Apr 25 '23 15:04 jkirschner-hashicorp

I've done this previously and it results in the same errors in the secondaries.

If cross signing isn't set during this operation the API rejects it

Apr 25 '23 16:04 reskin89

If cross signing isn't set during this operation the API rejects it

Do you get any relevant log messages when the rejection occurs?

Both the Vault and Consul CA providers support cross-signing, so I'm not sure why ForceWithoutCrossSigning would be needed here.

Apr 25 '23 16:04 jkirschner-hashicorp

Really sorry for the delay here @jkirschner-hashicorp , been in the weeds with the consul lambda stuff 😄 .

I will test this when I'm able to circle back to this, I need my environment stable at the moment

May 11 '23 18:05 reskin89

@reskin89: When you do circle back to this, I thought of a reason why the API might reject the cross-sign request. The Vault ACL token that Consul's Vault CA provider uses to ask the PKI Secrets Engine to perform the cross-sign operation needs elevated privileges: https://developer.hashicorp.com/consul/docs/connect/ca/vault#additional-vault-acl-policies-for-sensitive-operations

It's worth checking whether the Vault ACL token configured in the provider has the privileges needed for that cross-sign call to work.

May 11 '23 18:05 jkirschner-hashicorp

That makes a whole lot of sense when you say that, I'll definitely give that a look. Thank you!!

May 11 '23 18:05 reskin89

@jkirschner-hashicorp I've finally circled back to this.

Your recommendation on vault privileges worked, but now I'm in another spot where my secondary datacenters are failing.

I followed this doc: https://developer.hashicorp.com/consul/tutorials/security/tls-encryption-secure

to generate a CA, cert, and key file from my primary datacenter.

I applied them to my config.json for my server/agent config and my primary datacenter is up and running without error.

However, I tried to distribute the CA, cert, and keys to my secondaries and they're in a failing state.

One of my datacenters is just repeatedly stating that the CA and ACLs are still initializing, and its also just repeatedly stating "no cluster leader"

➜  ~ consul monitor -log-level debug
2023-12-18T16:57:11.340Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:11.340Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:12.341Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:12.341Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:12.754Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:62982 latency="181.953µs"
2023-12-18T16:57:12.758Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/peers from=127.0.0.1:62990 latency="63.234µs"
2023-12-18T16:57:13.341Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:13.341Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:14.342Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:14.342Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:15.342Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:15.343Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:16.343Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:16.343Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:17.071Z [DEBUG] agent.server.memberlist.wan: memberlist: Initiating push/pull sync with: REDACTED:8302
2023-12-18T16:57:17.344Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:17.344Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:18.344Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:18.344Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:19.344Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:19.344Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:20.345Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:20.345Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:21.345Z [DEBUG] agent.server.cert-manager: CA has not finished initializing
2023-12-18T16:57:21.345Z [DEBUG] agent.server.cert-manager: ACLs have not finished initializing
2023-12-18T16:57:22.057Z [ERROR] agent.http: Request error: method=GET url=/v1/health/state/any from=127.0.0.1:62996 error="No cluster leader"

My other secondary is repeatedly stating it has a bad tls certificate (I ran consul tls cert create -server for each datacenter and distributed the certificates accordingly) along with "no cluster leader" and seemingly no leader elections.

my new CA config:

{
	"Provider": "consul",
	"Config": {
		"IntermediateCertTTL": "8760h",
		"LeafCertTTL": "72h",
		"PrivateKeyBits": 256,
		"RootCertTTL": "87600h",
		"RotationPeriod": "2160h"
	},
	"State": null,
	"ForceWithoutCrossSigning": false,
	"CreateIndex": 8,
	"ModifyIndex": 15770485
}

I've also performed a consul join -wan between all of my datacenters as well to no avail.

Dec 18 '23 17:12 reskin89

@reskin89 : Were the secondary datacenters (with WAN federation) previously coming up fine? The agent TLS PKI is separate from the service mesh PKI, so I wouldn't expect a change to your service mesh PKI provider to affect anything about agent TLS or how you'd set it up for WAN fed.

(As an aside: I think our documentation could do a better job conveying these topics.)

Dec 18 '23 17:12 jkirschner-hashicorp

Yep, all datacenters were known working in a fine state prior to this update. I updated the CA config after generating the certs in the document mentioned, then I distributed them to all datacenters and restarted consul.

Dec 18 '23 17:12 reskin89

The primary is just fine which is what's interesting. What I'm now witnessing is that one of the secondaries finally selected a leader, but its unable to sign leafs stating it has no root, but if I hit /v1/connect/ca/roots I clearly see the previous root from vault (since there are some nodes that still have leaf's from it) and the Consul CA Primary Cert.

2023-12-18T17:38:16.115Z [DEBUG] agent: Node info in sync
2023-12-18T17:38:16.124Z [WARN]  agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA is uninitialized and unable to sign certificates yet: no root certificate" index=1
2023-12-18T17:38:17.252Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/leader from=127.0.0.1:49512 latency="106.265µs"
2023-12-18T17:38:17.255Z [DEBUG] agent.http: Request finished: method=GET url=/v1/status/peers from=127.0.0.1:49520 latency="92.674µs"
2023-12-18T17:38:17.899Z [ERROR] agent.server.rpc: failed to read byte: conn=from=10.142.79.189:41514 error="remote error: tls: bad certificate"

The other secondary I have is still stuck stating it has no leader and CA isn't initialized.

I'm registering to the primary with the new certificates just fine, so part of me is wondering if my additional-dnsnames aren't good? Per the docs it looks like it should just be *.DATACENTER.consui if I've kept domain defaults which I have.

Dec 18 '23 17:12 reskin89

@reskin89 : The document you followed is about the agent TLS PKI, which is what Consul server and client agents use to authenticate with each other for RPC communication.

That's entirely separate from the service mesh PKI, which allows services in the mesh to authenticate with each other. In changing your service mesh PKI (aka "connect CA") provider from Vault to Consul, there's nothing about the agent TLS PKI that needed to change.

Unfortunately, the term "built-in CA" is overloaded in Consul's documentation to refer to both:

"connect CA Consul provider" (a service mesh PKI construct)
a helper command (consul tls ca/cert) that can be used to create agent TLS PKI certificates

That overloaded term makes it even harder to tell that the agent TLS PKI and service mesh PKI are separate systems. The docs should be updated to disambiguate.

If you revert your agent TLS config (e.g., cert_file, ca_file, key_file) back to what it was when the secondary DCs were working, that should help.

Dec 18 '23 17:12 jkirschner-hashicorp

That would lead me back to my initial issue, my agents were getting x509 certificate signed by unknown authority after I updated to the consul CA. I attempted to distribute to them, the root, the intermediate, and a chain with the root and intermediate to no avail.

After I ran consul tls ca create and distributed the ca created there to my agent, it came online no problem (in my primary).

Is that step also not related, or is that part generated the CA chain that's signed by the root of the connect CA?

I'll revert my configs and see what happens

Dec 18 '23 17:12 reskin89

After config revert, one secondary:

Dec 18 17:55:09 us-east-1-01 consul[8350]: agent: Synced node info
Dec 18 17:55:09 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 17:55:12 us-east-1-01 consul[8350]: agent: (LAN) joined: number_of_nodes=3
Dec 18 17:55:12 us-east-1-01 consul[8350]: agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
Dec 18 17:55:16 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 17:55:25 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 17:55:53 us-east-1-01 consul[8350]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1

second secondary:

Dec 18 17:58:01 us-west-2-02 consul[8379]: agent: Synced node info
Dec 18 17:58:02 us-west-2-02 consul[8379]: 2023-12-18T17:58:02.394Z [WARN]  agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA is uninitialized and unable to sign certificates yet: no root certificate" index=1
Dec 18 17:58:02 us-west-2-02 consul[8379]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA is uninitialized and unable to sign certificates yet: no root certificate" index=1```

Dec 18 '23 17:12 reskin89

@reskin89 : Just to confirm, did you change the service mesh CA config by calling the CLI or API endpoint? Or by changing the agent configuration ca_config stanza?

I ask because the agent config stanza ca_config is only used when "initially bootstrapping the cluster" (according to the docs).

Dec 18 '23 18:12 jkirschner-hashicorp

I changed it via the cli with consul connect ca set-config

Dec 18 '23 18:12 reskin89

I found a stale json config that had a connect block in it for ca configs so I removed it. Now at this juncture its throwing the x509's again, and when I try to login to the UI of the secondaries I get this:

Dec 18 18:27:02 us-east-1-01 consul[14986]: 2023-12-18T18:27:02.772Z [WARN]  agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1
Dec 18 18:27:02 us-east-1-01 consul[14986]: agent.leaf-certs: handling error in Manager.Notify: error="rpc error making call: CA has not finished initializing" index=1

II should clarify, I get that x509 error logging in to one of my secondary's, the other I can run consul commands against, if I try to switch datacenters with the drop down in the UI, it tells me none of those servers are reachable 🤷

Dec 18 '23 18:12 reskin89