Support Listener Access Logging for TLS troubleshooting
Description: Envoy supports several types of Access Loggers. The access log that is most commonly in use is configured in HCM or TCP/UDP/Thrift Proxy. Envoy additionally supports a listener-level access log. The listener access log is particularly useful for:
- Providing information related to downstream connection establishment issues, e.g. using the
%DOWNSTREAM_TRANSPORT_FAILURE_REASON%operator, and following the TLS troubleshooting guide. - Providing information when other network-filter level access loggers are not invoked (e.g. when there is no matched filter chain,
NRresponse flag).
Currently, Envoy Gateway uses the user-defined access log format for the listener access log, but the listener access log is only emitted when there is no matching filter chain.
https://github.com/envoyproxy/gateway/blob/0abdda7a033f0e37647177a588a904c90b783149/internal/xds/translator/accesslog.go#L216
Envoy Gateway should also support listener access logs for TLS troubleshooting. Common issues include:
- Downstream TLS parameter mismatch: supported versions, ciphers, ...
- Downstream Client Certificate validation issues: expiration, trust store compatibility, ...
To support the TLS troubleshooting use case, Envoy Gateway may:
- Make the listener access log fully configurable for end-users (including format, filters, etc.)
- Conditionally emit a listener access log containing the
%DOWNSTREAM_TRANSPORT_FAILURE_REASON%when a TLS connection error occurs. This can be done by appending the TLS details to the user-defined/default access log format, or by emitting a different log that is fully controlled by EG and contains relevant information such as Client IP, Available TLS Context (SNI, Cert details, ...)
To filter listener logs based on a TLS failure, a CEL access log filter can be used:
filter:
extension_filter:
name: envoy.access_loggers.extension_filters.cel
typed_config:
'@type': type.googleapis.com/envoy.extensions.access_loggers.filters.cel.v3.ExpressionFilter
expression: connection.transport_failure_reason != "-"
Notes from today's community meeting:
- Prefer option (2)
- Emit the user-defined access log
- Users that are interested in the transport failure reason should add the relevant operator to their format
Emit the user-defined access log
@guydc @arkodg does this mean EG will have a separate use-defined access log format for listener-level log? From the envoy docs, listener-specific fields seem very limited.
TBH, is this a common use case? can it be covered by using EnvoyPatchPolicy (or ExtensionManager) instead of introuduce an new API?
@zhaohuabing my vote would be to use the same format thats defined in envoyProxy.spec.telemetry for all access log cases, and use the default value even for the listener access log case
is this a common use case?
I would say that observability for TLS failures, especially when client certificate auth is a feature, is important. Having said that, most proxies support this sort of troubleshooting through error logs, rather than access log.
can it be covered by using EnvoyPatchPolicy (or ExtensionManager) instead of introuduce an new API?
Yes. But, the current proposal is not to extend the API. Rather, we will emit listener access logs in more scenarios and not just NR.
In light of #3688, another approach here would be to make Listener log matcher configurable. This will:
- not change current default filters (only log for NR) or format
- allow users to emit listener access log for TLS troubleshooting or other reasons (e.g. observability for TCP connections in general)
@arkodg , @zirain - WDYT?
In light of #3688, another approach here would be to make Listener log matcher configurable. This will:
- not change current default filters (only log for NR) or format
- allow users to emit listener access log for TLS troubleshooting or other reasons (e.g. observability for TCP connections in general)
@arkodg , @zirain - WDYT?
sounds good
In light of https://github.com/envoyproxy/gateway/pull/3688, another approach here would be to make Listener log > matcher configurable. This will:
not change current default filters (only log for NR) or format allow users to emit listener access log for TLS troubleshooting or other reasons (e.g. observability for TCP connections in general)
Should the connection.transport_failure_reason != "-" and other listener-level errors(if any) be include in the default filters?
This issue has been automatically marked as stale because it has not had activity in the last 30 days.