envoy icon indicating copy to clipboard operation
envoy copied to clipboard

access logging: gRPC logger improvements for Envoy as TLS transparent proxy

Open niektemme opened this issue 2 years ago • 12 comments

Signed-off-by: Niek Temme [email protected]

Commit Message: access logging: gRPC logger add sni hostname and add fields for Envoy as TLS transparent proxy

Additional Description: Hi, I would like to suggest five changes/additions to gRPC access loger and accesslog proto file. These changes mainly benefit using Envoy as a TLS transparent proxy. I am not an experienced c++ developer and any feedback is welcome.

The problem is explained here: #22407

After digging into the code, the proposal is to make five modifications/additions to the gRPC logger. A detailed description per modification/addition is listed below.

  1. Set tls_sni_hostname also when TLS connection is not terminated.
  2. Add duration
  3. Add upstream_request_attempt_count
  4. Add connection_termination_details
  5. Add ja3_fingerprint when ja3_fingerprinting is enabled

Details

  1. Set tls_sni_hostname also when TLS connection is not terminated. This enables it to log the requested hostname when the tls listener filter is used, but the connection is simply forwarded to the upstream cluster. This is for example useful to see if envoy forwards to the correct cluster based on the requested hostname.

  2. Add duration The existing duration fields in the acceslog.proto are not present when using the tcp listener. The complete duration, based on stream_info.requestComplete(), however is available. A separate duration field is in line with the file access log.

  3. Add upstream_request_attempt_count Logs the number of times the connection request is attempted upstream. Note that the field is omitted when the connect request was never attempted upstream. This is particularly useful when the max_connect_attempts field is set in the tcp_proxy.

  4. Add connection_termination_details Connection termination details may provide additional information about why the connection was terminated by Envoy for L4 reasons. This is especiall usefull when network rbac envoy.filters.network.rbac . The connection termination details contains the info why a connection was blocked/allowed.

  5. Add ja3_fingerprint when ja3_fingerprinting is enabled The tls inspector listner filter has an option to generate ja3 fingerprint (default false). When available this logs the ja3 fingerprint. This field is omitted when the ja3 fingerprint is not available. The field is present when eiter envoy terminates the tls connection or acts as a transparent proxy.

Risk Level: Low/Medium - four additional fields are added, no fields removed or changed (no behavior changes expected)

The only change is that tls_properties is also added when it is not a terminated tls connection. In this scenario, of the tls_properties only the tls_sni_hostname is available. The other tls_properties fields are not present. As far as I can see, this is consistent with the behavior of the tls listener filter.

This could be a breaking change if the receiving application of the access log message checks for the presence of the tls_properries and when present expect all the fields to be there, or assumes this must be a tls terminated connection.

Testing: unit tests Development build using mac os x (m1 if that info is relevant), but with ‘default’ extensions. All tests pass.

manual testing Using a sample grpc log service. Displaying accesslog stream using golang protojson.Marshal() Tested

  • Tcp with tls
  • Tcp no tls
  • Http no tls

Docs Changes: I have added the documentation in the proto file accesslog.proto for data.accesslog.v3.AccessLogCommon, and data.accesslog.v3.TLSProperties.

Am I right in assuming that the documentation is generated from this file? If not, I have to modify the documentation.

Release Notes: access_log: log duration, upstream_request_attempt_count, connection_termination_details and tls ja3 field in the grpc access log and also log the tls sni and ja3 field in the grpc access log when envoy is configured as a tls forward proxy.

Platform Specific Features: N/A

Issues: Fixes: #22407

[Optional API Considerations:]

ls_sni_hostname:

  • See risk level above. The other option would be to add a requested_server_name field to the access_log_common properties. That would be present when either the tls connection is terminated or forwarded.
  • The downside of this solution would be that this would create a duplicate field for tls_sni_hostname when envoy terminates the tls connection.
  • The proposed solution: Filling tls_properties->tls_sni_hostame either when envoy terminates or forwards the tls connection is consistent with file access logging.

upstream_request_attempt_count

  • In file logging this field is 0 when there is no upstream connection attempt made. I believe that omitting this field when there is no value is more in line with the grcp ALS service.
  • In most scenarios the value of this field is 1. This field could also be adapted to a retry count. In that solution the field would be present only if the attempt > 1. This is however not consistent with file logging. The proposed solution is.

Edit 10 August 2022: Added release notes

niektemme avatar Aug 02 '22 18:08 niektemme

Hi @niektemme, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/22523 was opened by niektemme.

see: more, trace.

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/). envoyproxy/api-shepherds assignee is @markdroth CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

:cat:

Caused by: https://github.com/envoyproxy/envoy/pull/22523 was opened by niektemme.

see: more, trace.

/wait on CI

alyssawilk avatar Aug 02 '22 20:08 alyssawilk

Ah I see I made some code format/spelling mistakes. I will try to fix them to today and run check before submitting. sorry for the rookie mistake

niektemme avatar Aug 03 '22 10:08 niektemme

no worries - happens all the time (including for us non-newbies) =P

alyssawilk avatar Aug 03 '22 13:08 alyssawilk

Ah. I thought my local build setup ran all the integration test, but it does not. Perhaps because I use the Mac OS X build setup. The Mac OS X tests all passed ;-)

I see the grpc access log integration tests fail because it sees unexpected fields. That makes sense. I will probably be able to fix it on Friday and make sure I run all the integration tests locally before I submit again.

Impressive CI pipeline!

niektemme avatar Aug 03 '22 20:08 niektemme

/wait

Seems like CI test failures are legit

//test/extensions/access_loggers/grpc:http_grpc_access_log_integration_test FAILED in 2 out of 2 in 20.2s
  Stats over 2 runs: max = 20.2s, min = 20.0s, avg = 20.1s, dev = 0.1s
  /build/tmp/_bazel_envoybuild/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/bazel-out/k8-opt/testlogs/test/extensions/access_loggers/grpc/http_grpc_access_log_integration_test/test.log
  /build/tmp/_bazel_envoybuild/b570b5ccd0454dc9af9f65ab1833764d/execroot/envoy/bazel-out/k8-opt/testlogs/test/extensions/access_loggers/grpc/http_grpc_access_log_integration_test/test_attempts/attempt_1.log
//test/extensions/access_loggers/grpc:tcp_grpc_access_log_integration_test FAILED in 2 out of 2 in 7.8s

KBaichoo avatar Aug 03 '22 20:08 KBaichoo

The implementation makes sense. Can you check CI, and also add tests for those logging field that you added?

lizan avatar Aug 03 '22 23:08 lizan

The implementation makes sense. Can you check CI, and also add tests for those logging field that you added?

Thanks. Yes, I have fixed the failing existing integration tests by:

  1. Adding the duration field as a field to be removed because it is not deterministic. In the same way time_to_last_rx_byte was already removed.
  2. Adding upstream_request_attempt_count: 1 to the expected output.

I am now looking at creating additional integration tests for

  1. TLS trafic (tls_sni_hostname and ja3_fingerprint fields)
  2. RBAC (connection_termination_details field).
  3. upstream_request_attempt_count: 2 when retry = 2

This new integration tests are probably a little harder for me to do. I see the integration tests in the listener filters themselves probably give the insight in how to setup the correct configuration and client test set-up. I'll try to finish it over the weekend.

niektemme avatar Aug 04 '22 15:08 niektemme

Hi,

I fixed / added the integration tests. All tests in the pipeline success except for Mac (CI?), see below.

Failing Mac step (just CI?) I see the CI is failing at the mac step.

  • As far as I can see the tests were actually successful. [ PASSED ] 52 tests.
  • It failed on the process test results step with: Error in download_and_extract: java.io.IOException: Error downloading [https://github.com/indygreg/python-build-standalone/releases/download/20220227/cpython-3.10.2+20220227-x86_64-apple-darwin-install_only.tar.gz] to /Users/runner/work/1/s/_bazel_runner/3d1a7f52fd47fc9a8db71770aacd13e5/external/python3_10_x86_64-apple-darwin/temp12472564453881277793/cpython-3.10.2+20220227-x86_64-apple-darwin-install_only.tar.gz: Bytes read 14876672 but wanted 42785626
  • I am correct that this is a fault in the CI not the committed code?
  • It looks like I do not have the rights to redo just this step, or I do not know how.

Fixed in the existing integration test.

  • Adding the duration field as a field to be removed because it is not deterministic.
  • Adding upstream_request_attempt_count: 1 to the expected output.

Added integration tests in tcp_grpc_access_log_integegration_test

  • RBAC (new test). Integration test for connection_termination_details

New test tls_grpc_access_log_integration_test

  • ssl termination no ja3
  • ssl termination with ja3
  • no ssl termination no ja3
  • no ssl termination with ja3

These run the integration test of the GRPC access logs for the desired TLS fields.

tls_version
tls_cipher_suite
tls_sni_hostname
ja3_fingerprint      
local_certificate_properties
- subject_alt_name
- subject
peer_certificate_properties
- subject_alt_name
- subject

Separate test for TLS The tls access log tests are based on the integration tests of the tls listener filter. The setup is quite different from the tcp tests. That is why I separated it into a new test. As a c++ newbie it is hard to integrate the existing tcp tests with the tls tests. The different test setup could justify a separate test (file) from the tcp access log integration test.

no integration test for upstream_request_attempt_count >1 I was not able to make a good integration test for upstream_request_attempt_count >1. I could not find a way to make sure the value had a deterministic output other than 1. I also could not find an existing envoy integration test that did this.

To do / question I see I still left to lines committed out in the tcp_access_log_integration test that should be removed. Perhaps a re run will also fix the Mac ci issue. I was hoping to get feedback on this commit first before doing a re-run of the CI.

niektemme avatar Aug 08 '22 15:08 niektemme

Small update

  • Removed unnecessary commented out lines and replaced them with correct comments (as mentioned in previous comment).

Mac CI did run correctly this time (no code change). I see a warning with the CI of package arm -> post job cache ##[error]The server is busy. Please try again later. I hope this is not an issue.

I am traveling between August 11 and August 21 and will probably not be able to check back (in a timely manner) during this time. Sorry if this causes an inconvenience.

niektemme avatar Aug 09 '22 10:08 niektemme

Sorry, I realised my proposed code for adding TLS fields could be simplified. Changed it and reran all the tests. This looks better (no redundant code).

Another advantage is that the ja3 filed is now also filled if there is no sni (this is a possibility). I added the integration test for that scenario as well. All passes.

niektemme avatar Aug 10 '22 14:08 niektemme

Thanks for getting this. Can you also add a release note into changelog?

Of course! I am however unable to find how to add the release note. The text is below. I guess I am missing it. Could you point me in the right direction on how do that?

access_log: log duration, upstream_request_attempt_count, connection_termination_details and tls ja3 field in the grpc access log and also log the tls sni and ja3 field in the grpc access log when envoy is configured as a tls forward proxy.

niektemme avatar Sep 01 '22 14:09 niektemme

@niektemme add an entry to https://github.com/envoyproxy/envoy/blob/main/changelogs/current.yaml

lizan avatar Sep 01 '22 16:09 lizan

@lizan updated changelogs/current.yaml and accesslog.proto for documentation.

niektemme avatar Sep 03 '22 08:09 niektemme

Hi, Sorry to ask. Is there still something left for met to do? Did I miss something or do something incorrectly?

niektemme avatar Sep 15 '22 06:09 niektemme

@niektemme sorry I was off for a week for medical reasons and now back. Can you merge main to resolve conflicts? Otherwise it looks great.

lizan avatar Sep 16 '22 06:09 lizan

@lizan I hope everything is ok. I updated changelog/current.yaml from main branch. The only conflict now my addition. I can not resolve this myself right as I do not have write access to main? Or do I not understand correctly?

niektemme avatar Sep 16 '22 15:09 niektemme

@niektemme you can merge main branch to your PR branch and resolve conflicts, then push that change to your PR branch.

lizan avatar Sep 16 '22 15:09 lizan

@lizan thank you for the instructions. I did not know that was possible. I merged the main branch and pushed the pr again. The conflict is now resolved. Ci was also now successful.

niektemme avatar Sep 17 '22 20:09 niektemme