beyla
beyla copied to clipboard
Change how we correlate SSL to connection information
Until now, the way we correlated SSL information to a connection was based on looking for traffic while SSL_read or SSL_write were running. This sort of worked, but we recently found that it can be unstable with relation to sort interrupts. We can take a look at the following Beyla kernel trace:
nginx-6195 [003] d... 285251.759323: bpf_trace_printk: tcp_sendmsg for identified SSL connection, ignoring...
nginx-6195 [003] d... 285251.759324: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===
nginx-6195 [003] d... 285251.759327: bpf_trace_printk: === kprobe tcp_rate_check_app_limited=6195 sock=ffff8eaf47c63780 ===
nginx-6195 [003] d... 285251.759328: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=3012
nginx-6195 [003] d... 285251.759328: bpf_trace_printk: [conn] d_h = 0, d_l = 64011aacffff0000, d_port=34418
>>> good correlation of the SSL 0x561882481e20 to the correct connection info
nginx-6195 [003] d... 285251.759330: bpf_trace_printk: Correlating SSL 561882481e20 to connection
nginx-6195 [003] d... 285251.759331: bpf_trace_printk: [conn] s_h = 0, s_l = 64011aacffff0000, s_port=34418
nginx-6195 [003] d... 285251.759332: bpf_trace_printk: [conn] d_h = 0, d_l = c2141aacffff0000, d_port=3012
nginx-6195 [003] d... 285251.759338: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===
We can notice here a start of a connection being established, followed by a send message to an unrelated port,
but it's being handled by our thread. This makes no sense, until we look at the printk flags
...d.s1.
s - means that we are handling a soft IRQ
1 - means that we are one level down preempted
So essentially, our work was interrupted and the thread was hijacked to handle the interrupt, causing traffic "under"
SSL_write which is unrelated.
nginx-6195 [003] d.s1 285251.759389: bpf_trace_printk: === tcp_rcv_established id=6195 ===
nginx-6195 [003] d.s1 285251.759400: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
nginx-6195 [003] d.s1 285251.759400: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
nginx-6195 [003] d.s1 285251.759410: bpf_trace_printk: === tcp_rcv_established id=6195 ===
nginx-6195 [003] d.s1 285251.759412: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
nginx-6195 [003] d.s1 285251.759413: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
nginx-6195 [003] d.s1 285251.759415: bpf_trace_printk: === kprobe tcp_rate_check_app_limited=6195 sock=ffff8eaded168000 ===
nginx-6195 [003] d.s1 285251.759417: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
nginx-6195 [003] d.s1 285251.759418: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
>>> correlating to the wrong information
nginx-6195 [003] d.s1 285251.759420: bpf_trace_printk: Correlating SSL 561882481e20 to connection
nginx-6195 [003] d.s1 285251.759421: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
nginx-6195 [003] d.s1 285251.759422: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
nginx-6195 [003] d.s1 285251.759424: bpf_trace_printk: marked buffer as inactive
nginx-6195 [003] d.s1 285251.759425: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
nginx-6195 [003] d.s1 285251.759426: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
nginx-6195 [003] d.s1 285251.759427: bpf_trace_printk: === kprobe SSL tcp_sendmsg=6195 ssl=561882481e20 ===
nginx-6195 [003] d... 285251.759433: bpf_trace_printk: === kretprobe tcp_sendmsg=6195 sent 836===
nginx-6195 [003] d... 285251.759439: bpf_trace_printk: === uretprobe SSL_write id=6195 args ffffb0024ccd9f38 ===
<<< SSL_write is complete, but the SSL pointer now is correlated to the wrong connection information.
nginx-6195 [003] d... 285251.759441: bpf_trace_printk: SSL_buf id=6195 ssl=561882481e20
nginx-6195 [003] d... 285251.759442: bpf_trace_printk: SSL conn
nginx-6195 [003] d... 285251.759442: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
nginx-6195 [003] d... 285251.759443: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
nginx-6195 [003] d... 285251.759445: bpf_trace_printk: buf=[HTTP/1.1 200 OK
Server:�RG��V], pid=6195, len=807
>>> we can't find the original request for the request start and we fail to close the request.
nginx-6195 [003] d... 285251.759446: bpf_trace_printk: No info, pid =6195?
nginx-6195 [003] d... 285251.759448: bpf_trace_printk: [conn] s_h = 0, s_l = c2141aacffff0000, s_port=40994
nginx-6195 [003] d... 285251.759449: bpf_trace_printk: [conn] d_h = 0, d_l = a8e84903ffff0000, d_port=443
So, while are handling the SSL operations, we can be interrupted to handle something else on the same thread which throws a wrench in our SSL to connection tracking.
To counter that, we now do something else:
- We pull the file descriptor from the SSL BIO structs.
- We look up the thread task file descriptor table and pull the file * for the given file descriptor in SSL.
- We check to see if the file has a socket attached, which should always be the case for socket operations.
- We pull the connection information from the socket and associate it to the SSL pointer.
Testing:
- [x] Integration tests must pass, they do a lot of SSL connection checks
- [ ] I'll test manually with istio book shop, which has some interesting corner cases.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 74.41%. Comparing base (
16a15f8) to head (5b8c42f). Report is 2 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1875 +/- ##
==========================================
+ Coverage 73.96% 74.41% +0.44%
==========================================
Files 177 177
Lines 19423 19432 +9
==========================================
+ Hits 14367 14460 +93
+ Misses 4319 4232 -87
- Partials 737 740 +3
| Flag | Coverage Δ | |
|---|---|---|
| integration-test | 57.66% <ø> (+<0.01%) |
:arrow_up: |
| k8s-integration-test | 55.92% <ø> (+0.76%) |
:arrow_up: |
| oats-test | 35.87% <ø> (+0.28%) |
:arrow_up: |
| unittests | 46.00% <ø> (+0.05%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
This actually didn't quite work. We can't read the SSL BIO like that for a few reasons:
- Doesn't work with boringssl, they use a different struct.
- Doesn't work with openssl 3.3+.
I'm working on a different approach.