beyla icon indicating copy to clipboard operation
beyla copied to clipboard

Change how we correlate SSL to connection information

Open grcevski opened this issue 7 months ago • 2 comments

Until now, the way we correlated SSL information to a connection was based on looking for traffic while SSL_read or SSL_write were running. This sort of worked, but we recently found that it can be unstable with relation to sort interrupts. We can take a look at the following Beyla kernel trace:

           nginx-6195    [003] d... 285251.759323: bpf_trace_printk: tcp_sendmsg for identified SSL connection, ignoring...
           nginx-6195    [003] d... 285251.759324: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===
           nginx-6195    [003] d... 285251.759327: bpf_trace_printk: === kprobe tcp_rate_check_app_limited=6195 sock=ffff8eaf47c63780 ===
           nginx-6195    [003] d... 285251.759328: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=3012
           nginx-6195    [003] d... 285251.759328: bpf_trace_printk: [conn] d_h = 0, d_l = 64011aacffff0000, d_port=34418

>>> good correlation of the SSL 0x561882481e20 to the correct connection info
           nginx-6195    [003] d... 285251.759330: bpf_trace_printk: Correlating SSL 561882481e20 to connection
           nginx-6195    [003] d... 285251.759331: bpf_trace_printk: [conn] s_h = 0, s_l = 64011aacffff0000, s_port=34418
           nginx-6195    [003] d... 285251.759332: bpf_trace_printk: [conn] d_h = 0, d_l = c2141aacffff0000, d_port=3012
           nginx-6195    [003] d... 285251.759338: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===

We can notice here a start of a connection being established, followed by a send message to an unrelated port,
but it's being handled by our thread. This makes no sense, until we look at the printk flags

...d.s1.
 
s - means that we are handling a soft IRQ
1 - means that we are one level down preempted

So essentially, our work was interrupted and the thread was hijacked to handle the interrupt, causing traffic "under"
SSL_write which is unrelated.

           nginx-6195    [003] d.s1 285251.759389: bpf_trace_printk: === tcp_rcv_established id=6195 ===
           nginx-6195    [003] d.s1 285251.759400: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759400: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759410: bpf_trace_printk: === tcp_rcv_established id=6195 ===
           nginx-6195    [003] d.s1 285251.759412: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759413: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759415: bpf_trace_printk: === kprobe tcp_rate_check_app_limited=6195 sock=ffff8eaded168000 ===
           nginx-6195    [003] d.s1 285251.759417: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759418: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443

>>> correlating to the wrong information
           nginx-6195    [003] d.s1 285251.759420: bpf_trace_printk: Correlating SSL 561882481e20 to connection
           nginx-6195    [003] d.s1 285251.759421: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759422: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759424: bpf_trace_printk: marked buffer as inactive
           nginx-6195    [003] d.s1 285251.759425: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d.s1 285251.759426: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d.s1 285251.759427: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===
           nginx-6195    [003] d... 285251.759433: bpf_trace_printk: === kretprobe tcp_sendmsg=6195 sent 836===
           nginx-6195    [003] d... 285251.759439: bpf_trace_printk: === uretprobe SSL_write id=6195 args ffffb0024ccd9f38 ===

<<< SSL_write is complete, but the SSL pointer now is correlated to the wrong connection information.

           nginx-6195    [003] d... 285251.759441: bpf_trace_printk: SSL_buf id=6195 ssl=561882481e20
           nginx-6195    [003] d... 285251.759442: bpf_trace_printk: SSL conn
           nginx-6195    [003] d... 285251.759442: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d... 285251.759443: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
           nginx-6195    [003] d... 285251.759445: bpf_trace_printk: buf=[HTTP/1.1 200 OK
Server:�RG��V], pid=6195, len=807

>>> we can't find the original request for the request start and we fail to close the request.
           nginx-6195    [003] d... 285251.759446: bpf_trace_printk: No info, pid =6195?
           nginx-6195    [003] d... 285251.759448: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
           nginx-6195    [003] d... 285251.759449: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443

So, while are handling the SSL operations, we can be interrupted to handle something else on the same thread which throws a wrench in our SSL to connection tracking.

To counter that, we now do something else:

  1. We pull the file descriptor from the SSL BIO structs.
  2. We look up the thread task file descriptor table and pull the file * for the given file descriptor in SSL.
  3. We check to see if the file has a socket attached, which should always be the case for socket operations.
  4. We pull the connection information from the socket and associate it to the SSL pointer.

Testing:

  • [x] Integration tests must pass, they do a lot of SSL connection checks
  • [ ] I'll test manually with istio book shop, which has some interesting corner cases.

grcevski avatar Apr 30 '25 16:04 grcevski

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 74.41%. Comparing base (16a15f8) to head (5b8c42f). Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1875      +/-   ##
==========================================
+ Coverage   73.96%   74.41%   +0.44%     
==========================================
  Files         177      177              
  Lines       19423    19432       +9     
==========================================
+ Hits        14367    14460      +93     
+ Misses       4319     4232      -87     
- Partials      737      740       +3     
Flag Coverage Δ
integration-test 57.66% <ø> (+<0.01%) :arrow_up:
k8s-integration-test 55.92% <ø> (+0.76%) :arrow_up:
oats-test 35.87% <ø> (+0.28%) :arrow_up:
unittests 46.00% <ø> (+0.05%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Apr 30 '25 16:04 codecov-commenter

This actually didn't quite work. We can't read the SSL BIO like that for a few reasons:

  • Doesn't work with boringssl, they use a different struct.
  • Doesn't work with openssl 3.3+.

I'm working on a different approach.

grcevski avatar Apr 30 '25 23:04 grcevski