security/ldap: clarify that cert verification error is due to missing LDAP CA cert
❯ cockroach sql --url "postgresql://testuser@localhost:26257"
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: failed to connect to `host=localhost user=testuser database=`: failed to write startup message (tls: failed to verify certificate: x509: certificate signed by unknown authority)
Failed running "sql"
If I don’t setup my CA cert for LDAP using SET cluster setting server.ldap_authentication.domain.custom_ca="", I get this vague error. The error is expected but we need to be explicit that this due to the missing LDAP CA certificate and bubble that up. Otherwise it’s confusing which cert it is referring to.
The error message ideally should be something along these lines:
ERROR: failed to authenticate user via LDAP (tls: failed to verify LDAP server's certificate: x509: certificate signed by unknown authority)
Failed running "sql"
Jira issue: CRDB-41388
Epic CRDB-33829
Currently the sql client fails with error:
ERROR: LDAP authentication: unable to establish LDAP connection
SQLSTATE: 28000
Failed running "sql"
The log obtained in this case is:
I241014 07:41:07.215924 484544 4@util/log/event_log.go:39 ⋮ [T1,Vsystem,n1,client=127.0.0.1:59669,hostssl,user=‹souravcrl›] 13 ={"Timestamp":1728891667215917000,"EventType":"client_authentication_failed","InstanceID":1,"Network":"tcp","RemoteAddress":"‹127.0.0.1:59669›","SessionID":"17fe41cf77d101980000000000000001","Transport":"hostssl","SystemIdentity":"‹souravcrl›","Reason":"USER_RETRIEVAL_ERROR","Detail":"LDAP authentication: unable to establish LDAP connection\nerror when trying to create LDAP connection: LDAPs connection failed: ‹LDAP Result Code 200 "Network Error"›: ‹tls: failed to verify certificate›: ‹x509: “crlcloud.dev” certificate is not trusted›","Method":"ldap"}
@rafiss we had the discussion around errors from LDAP server here: https://reviewable.io/reviews/cockroachdb/cockroach/126227#-O1I2xxU3oIWRvw10dLD
errors.WithMessagef will put the detailed error in prefix whereas detailed error is being obtained from a more downstream library module. I had to return from ValidateLDAPLogin this way since I can't send the detailed error to sql client but needed to log it here for debugging purposes. I don't really want to redact it also since redaction can't be done in a granular fashion as error is from an internal library module and I want to log it entirely as it was not formatted by go-ldap library to redact PII.
Should we expect a client error here as other connect attempts may result in different errors containing sensitive information like server url, port, no-TLS enabled on the server or should we leave this as is?
That may be a question for @dikshant -- is it OK to not show the full explanation back to the user as long as the full explanation is included in the logs?
I believe the reasoning that @souravcrl describes is that if we show the full explanation for this error, it will be hard to differentiate that from other possible errors, which may expose unwanted information back to an unauthenticated client.
If we absolutely need to show the full error in this case, then perhaps we could just do rudimentary string comparison to decide if we can show the full error. This is not guaranteed to keep working if the error message ever changes when we upgrade this library in the future.
Had this discussion with Biplav, since the client error is not guaranteed to be regex matched and we don't want to expose exact error details, will be just proceeding with having the error in the logs.