envoy
envoy copied to clipboard
access logging: gRPC logger improvements for Envoy as TLS transparent proxy
Signed-off-by: Niek Temme [email protected]
Commit Message: access logging: gRPC logger add sni hostname and add fields for Envoy as TLS transparent proxy
Additional Description: Hi, I would like to suggest five changes/additions to gRPC access loger and accesslog proto file. These changes mainly benefit using Envoy as a TLS transparent proxy. I am not an experienced c++ developer and any feedback is welcome.
The problem is explained here: #22407
After digging into the code, the proposal is to make five modifications/additions to the gRPC logger. A detailed description per modification/addition is listed below.
- Set tls_sni_hostname also when TLS connection is not terminated.
- Add duration
- Add upstream_request_attempt_count
- Add connection_termination_details
- Add ja3_fingerprint when ja3_fingerprinting is enabled
Details
-
Set tls_sni_hostname also when TLS connection is not terminated. This enables it to log the requested hostname when the tls listener filter is used, but the connection is simply forwarded to the upstream cluster. This is for example useful to see if envoy forwards to the correct cluster based on the requested hostname.
-
Add duration The existing duration fields in the acceslog.proto are not present when using the tcp listener. The complete duration, based on stream_info.requestComplete(), however is available. A separate duration field is in line with the file access log.
-
Add upstream_request_attempt_count Logs the number of times the connection request is attempted upstream. Note that the field is omitted when the connect request was never attempted upstream. This is particularly useful when the max_connect_attempts field is set in the tcp_proxy.
-
Add connection_termination_details Connection termination details may provide additional information about why the connection was terminated by Envoy for L4 reasons. This is especiall usefull when network rbac envoy.filters.network.rbac . The connection termination details contains the info why a connection was blocked/allowed.
-
Add ja3_fingerprint when ja3_fingerprinting is enabled The tls inspector listner filter has an option to generate ja3 fingerprint (default false). When available this logs the ja3 fingerprint. This field is omitted when the ja3 fingerprint is not available. The field is present when eiter envoy terminates the tls connection or acts as a transparent proxy.
Risk Level: Low/Medium - four additional fields are added, no fields removed or changed (no behavior changes expected)
The only change is that tls_properties is also added when it is not a terminated tls connection. In this scenario, of the tls_properties only the tls_sni_hostname is available. The other tls_properties fields are not present. As far as I can see, this is consistent with the behavior of the tls listener filter.
This could be a breaking change if the receiving application of the access log message checks for the presence of the tls_properries and when present expect all the fields to be there, or assumes this must be a tls terminated connection.
Testing: unit tests Development build using mac os x (m1 if that info is relevant), but with ‘default’ extensions. All tests pass.
manual testing Using a sample grpc log service. Displaying accesslog stream using golang protojson.Marshal() Tested
- Tcp with tls
- Tcp no tls
- Http no tls
Docs Changes: I have added the documentation in the proto file accesslog.proto for data.accesslog.v3.AccessLogCommon, and data.accesslog.v3.TLSProperties.
Am I right in assuming that the documentation is generated from this file? If not, I have to modify the documentation.
Release Notes:
access_log: log duration
, upstream_request_attempt_count
, connection_termination_details
and tls ja3
field in the grpc access log and also log the tls sni
and ja3
field in the grpc access log when envoy is configured as a tls forward proxy.
Platform Specific Features: N/A
Issues: Fixes: #22407
[Optional API Considerations:]
ls_sni_hostname:
- See risk level above. The other option would be to add a requested_server_name field to the access_log_common properties. That would be present when either the tls connection is terminated or forwarded.
- The downside of this solution would be that this would create a duplicate field for tls_sni_hostname when envoy terminates the tls connection.
- The proposed solution: Filling tls_properties->tls_sni_hostame either when envoy terminates or forwards the tls connection is consistent with file access logging.
upstream_request_attempt_count
- In file logging this field is 0 when there is no upstream connection attempt made. I believe that omitting this field when there is no value is more in line with the grcp ALS service.
- In most scenarios the value of this field is 1. This field could also be adapted to a retry count. In that solution the field would be present only if the attempt > 1. This is however not consistent with file logging. The proposed solution is.
Edit 10 August 2022: Added release notes
Hi @niektemme, welcome and thank you for your contribution.
We will try to review your Pull Request as quickly as possible.
In the meantime, please take a look at the contribution guidelines if you have not done so already.
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/)
.
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/)
.
/wait on CI
Ah I see I made some code format/spelling mistakes. I will try to fix them to today and run check before submitting. sorry for the rookie mistake
no worries - happens all the time (including for us non-newbies) =P
Ah. I thought my local build setup ran all the integration test, but it does not. Perhaps because I use the Mac OS X build setup. The Mac OS X tests all passed ;-)
I see the grpc access log integration tests fail because it sees unexpected fields. That makes sense. I will probably be able to fix it on Friday and make sure I run all the integration tests locally before I submit again.
Impressive CI pipeline!
/wait
Seems like CI test failures are legit
//test/extensions/access_loggers/grpc:http_grpc_access_log_integration_test FAILED in 2 out of 2 in 20.2s
Stats over 2 runs: max = 20.2s, min = 20.0s, avg = 20.1s, dev = 0.1s
/build/tmp/_bazel_envoybuild/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/bazel-out/k8-opt/testlogs/test/extensions/access_loggers/grpc/http_grpc_access_log_integration_test/test.log
/build/tmp/_bazel_envoybuild/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/bazel-out/k8-opt/testlogs/test/extensions/access_loggers/grpc/http_grpc_access_log_integration_test/test_attempts/attempt_1.log
//test/extensions/access_loggers/grpc:tcp_grpc_access_log_integration_test FAILED in 2 out of 2 in 7.8s
The implementation makes sense. Can you check CI, and also add tests for those logging field that you added?
The implementation makes sense. Can you check CI, and also add tests for those logging field that you added?
Thanks. Yes, I have fixed the failing existing integration tests by:
- Adding the
duration
field as a field to be removed because it is not deterministic. In the same waytime_to_last_rx_byte
was already removed. - Adding
upstream_request_attempt_count: 1
to the expected output.
I am now looking at creating additional integration tests for
- TLS trafic (
tls_sni_hostname
andja3_fingerprint
fields) - RBAC (
connection_termination_details
field). -
upstream_request_attempt_count: 2
when retry = 2
This new integration tests are probably a little harder for me to do. I see the integration tests in the listener filters themselves probably give the insight in how to setup the correct configuration and client test set-up. I'll try to finish it over the weekend.
Hi,
I fixed / added the integration tests. All tests in the pipeline success except for Mac (CI?), see below.
Failing Mac step (just CI?) I see the CI is failing at the mac step.
- As far as I can see the tests were actually successful. [ PASSED ] 52 tests.
- It failed on the process test results step with:
Error in download_and_extract: java.io.IOException: Error downloading [https://github.com/indygreg/python-build-standalone/releases/download/20220227/cpython-3.10.2+20220227-x86_64-apple-darwin-install_only.tar.gz] to /Users/runner/work/1/s/_bazel_runner/3d1a7f52fd47fc9a8db71770aacd13e5/external/python3_10_x86_64-apple-darwin/temp12472564453881277793/cpython-3.10.2+20220227-x86_64-apple-darwin-install_only.tar.gz: Bytes read 14876672 but wanted 42785626
- I am correct that this is a fault in the CI not the committed code?
- It looks like I do not have the rights to redo just this step, or I do not know how.
Fixed in the existing integration test.
- Adding the
duration
field as a field to be removed because it is not deterministic. - Adding
upstream_request_attempt_count: 1
to the expected output.
Added integration tests in tcp_grpc_access_log_integegration_test
- RBAC (new test). Integration test for connection_termination_details
New test tls_grpc_access_log_integration_test
- ssl termination no ja3
- ssl termination with ja3
- no ssl termination no ja3
- no ssl termination with ja3
These run the integration test of the GRPC access logs for the desired TLS fields.
tls_version
tls_cipher_suite
tls_sni_hostname
ja3_fingerprint
local_certificate_properties
- subject_alt_name
- subject
peer_certificate_properties
- subject_alt_name
- subject
Separate test for TLS The tls access log tests are based on the integration tests of the tls listener filter. The setup is quite different from the tcp tests. That is why I separated it into a new test. As a c++ newbie it is hard to integrate the existing tcp tests with the tls tests. The different test setup could justify a separate test (file) from the tcp access log integration test.
no integration test for upstream_request_attempt_count >1 I was not able to make a good integration test for upstream_request_attempt_count >1. I could not find a way to make sure the value had a deterministic output other than 1. I also could not find an existing envoy integration test that did this.
To do / question I see I still left to lines committed out in the tcp_access_log_integration test that should be removed. Perhaps a re run will also fix the Mac ci issue. I was hoping to get feedback on this commit first before doing a re-run of the CI.
Small update
- Removed unnecessary commented out lines and replaced them with correct comments (as mentioned in previous comment).
Mac CI did run correctly this time (no code change). I see a warning with the CI of package arm -> post job cache ##[error]The server is busy. Please try again later.
I hope this is not an issue.
I am traveling between August 11 and August 21 and will probably not be able to check back (in a timely manner) during this time. Sorry if this causes an inconvenience.
Sorry, I realised my proposed code for adding TLS fields could be simplified. Changed it and reran all the tests. This looks better (no redundant code).
Another advantage is that the ja3 filed is now also filled if there is no sni (this is a possibility). I added the integration test for that scenario as well. All passes.
Thanks for getting this. Can you also add a release note into changelog?
Of course! I am however unable to find how to add the release note. The text is below. I guess I am missing it. Could you point me in the right direction on how do that?
access_log: log duration, upstream_request_attempt_count, connection_termination_details and tls ja3 field in the grpc access log and also log the tls sni and ja3 field in the grpc access log when envoy is configured as a tls forward proxy.
@niektemme add an entry to https://github.com/envoyproxy/envoy/blob/main/changelogs/current.yaml
@lizan updated changelogs/current.yaml
and accesslog.proto
for documentation.
Hi, Sorry to ask. Is there still something left for met to do? Did I miss something or do something incorrectly?
@niektemme sorry I was off for a week for medical reasons and now back. Can you merge main to resolve conflicts? Otherwise it looks great.
@lizan I hope everything is ok. I updated changelog/current.yaml from main branch. The only conflict now my addition. I can not resolve this myself right as I do not have write access to main? Or do I not understand correctly?
@niektemme you can merge main branch to your PR branch and resolve conflicts, then push that change to your PR branch.
@lizan thank you for the instructions. I did not know that was possible. I merged the main branch and pushed the pr again. The conflict is now resolved. Ci was also now successful.