No longer possible to use label_selector in auto_join field
Describe the bug
After https://github.com/hashicorp/vault/pull/29228 using a k8s label selector fails with parsing error due to the validation being done on the auto_join field.
To Reproduce Steps to reproduce the behavior:
- Set the auto_join in retry_join to
provider=k8s namespace=vault label_selector=\"app.kubernetes.io/name=vault,component=server\"" - Try to start vault
- See error:
vault error loading configuration from /tmp/storageconfig.hcl: error parsing 'storage': malformed auto_join pair label_selector="app.kubernetes.io/name=vault,, expected key=value
Expected behavior Expectation is that the server starts up correctly and uses the label_selector as specified
Environment:
- Vault Server Version (retrieve with
vault status): 1.19.0 - Vault CLI Version (retrieve with
vault version): N/A - Server Operating System/Architecture: Amazon linux
Vault server configuration file(s):
storage "raft" {
path = "/vault/data"
autopilot_reconcile_interval = "10s" # default is "10s"
autopilot_update_interval = "2s" # default is "2s"
# Ref: https://developer.hashicorp.com/vault/tutorials/operations/performance-tuning#performance_multiplier
performance_multiplier = 1
retry_join {
auto_join = "provider=k8s namespace=vault label_selector=\"app.kubernetes.io/name=vault, component=server\""
auto_join_scheme = "https"
leader_tls_servername = "vault"
# Still need these specified when using a self-signed cert
leader_ca_cert_file = "/vault/userconfig/tls-ca/ca.crt"
leader_client_cert_file = "/vault/userconfig/tls-server/tls.crt"
leader_client_key_file = "/vault/userconfig/tls-server/tls.key"
}
}
Additional context Add any other context about the problem here.
Thank you so much for the report, @tedo-benchling! We especially appreciated the extra effort you spent on identifying the PR that caused this. Look for this to be fixed in the next release. :)
@heatherezell hi, still encounted such error in 1.19.1 during fresh installation with the same configuration as 1.18.5, which was working fine in 1.18.5 failed to parse addresses from auto-join metadata: discover: label_selector: - equals in key's value, enclosing double-quote needed label_selector="value-with-=-symbol""
retry_join { auto_join = "provider=k8s label_selector=\"app.kubernetes.io/name=vault,component=server\" namespace=\"vault-cluster\"" auto_join_scheme = "http" }
Edit: After adding a space between comma and component in label_selector, the error is gone. Strangely, it is working fine in 1.18.5 and earlier version without the space.
I seem to be having this issues running 1.19.1 also. I also can't find the fix in the release notes for 1.19.1, but it is in the changelog. Was trying to find out if something else may have changed, but I can't find for certain that it was fixed in a subsequent release.
It looks like this is actually a bug in go-discover where the re-encoded representation of some parsed config (which we now use) is not correct.
Given the original configuration of:
"provider=k8s namespace=vault label_selector=\"app.kubernetes.io/name=vault,component=server\""
We'll parse and normalize and then re-encode with config.String() to get:
provider=k8s label_selector=app.kubernetes.io/name=vault,component=server namespace=vault
when it ought to be:
provider=k8s label_selector=\"app.kubernetes.io/name=vault,component=server\" namespace=vault
Until we get it fixed upstream I'd recommend adding a space between values in the label selector.
Not the answer you're likely seeking, but a workaround (and my preferred method) of joining in a kube cluster doesn't directly talk to the Kube API at all. We rely on DNS records for headless Services.
Given your settings in the example above, the following would be the equivalent while relying on Kubernetes' built-in DNS feature and the Vault Helm chart's headless service, named vault-internal, created by default.
storage "raft" {
# ... snip ...
retry_join {
leader_api_addr = "http://vault-internal:8200"
}
}
This resolves all IPs (IPv4 or IPv6) of the pods in the service named vault-internal. That's what the auto_join block was looking to do anyway, now with fewer moving parts. (docs)
1.19.5 fixed the issue ✅
This is fixed for me as well.. Going to close