pixie icon indicating copy to clipboard operation
pixie copied to clipboard

Go TLS tracing not working on Amazon Linux 2023

Open SamuraiPrinciple opened this issue 1 year ago • 4 comments

Describe the bug Go TLS tracing does not work on Amazon Linux 2023.

To Reproduce Provision EKS cluster (v1.30) with two node groups, one running AL2, one running AL2023. Deploy pixie. Run a Go (1.21 or 1.22) app (pod/deployment) that's making outbound HTTPS requests (HTTP 1.1). Go to px/http_data and try and observe said outbound traffic. Only traffic initiated by the pods running on AL2 node group would appear.

Expected behavior Be able to see traffic initiated by pods scheduled on node group running AL2023.

Logs Please attach the logs by running the following command:

./px collect-logs

App information (please complete the following information):

  • Pixie version: 0.14.10
  • K8s cluster version: EKS 1.30
  • Node Kernel version: 6.1.97-104.177.amzn2023.x86_64
  • Browser version: -

pem-AL2.log pem-al2023.log

SamuraiPrinciple avatar Aug 18 '24 20:08 SamuraiPrinciple

@SamuraiPrinciple thanks for reporting this. I'm aiming to try to reproduce the issue later this week and will report back when I have more information.

ddelnano avatar Aug 20 '24 14:08 ddelnano

Hi @ddelnano - many thanks for the quick reply, it really is appreciated. Do let me know if I could be of any help (more logs, screenshare, ...).

Take care and wish you a great day.

SamuraiPrinciple avatar Aug 20 '24 15:08 SamuraiPrinciple

This is still on my radar. I'm expecting to have time to look into this at the end of the week.

ddelnano avatar Aug 28 '24 15:08 ddelnano

That's great, thanks for the update. I've tried with the latest pixie version (0.14.11) and it's still not working.

SamuraiPrinciple avatar Aug 28 '24 15:08 SamuraiPrinciple

Hi @ddelnano hope all is well :) Wondering if you've managed to reproduce the issue? Happy to provide more details (EKS AMIs, pixie deployment config and sample go app) in case you didn't manage to get a repro...

Many thanks, damjan

SamuraiPrinciple avatar Dec 09 '24 14:12 SamuraiPrinciple

@SamuraiPrinciple sorry for missing to follow up on this! I'll be giving this some attention tomorrow or Thursday.

ddelnano avatar Dec 10 '24 19:12 ddelnano

No, no, no problem at all - just let me know if you're struggling to reproduce it, very happy to make it as easy as possible for you.

Thanks for the reply and your hard work!

SamuraiPrinciple avatar Dec 10 '24 19:12 SamuraiPrinciple

Since our existing Go tls tracing test hasn't been updated for Go 1.22 yet, I wanted to rule out any obvious incompatibility. We will eventually have that version under test, but until that is upstream I've created this branch which adds coverage for Go 1.22. The test passes on that branch, which is a good sign. My next step is to reproduce the issue on EKS w/ AL2023.

@SamuraiPrinciple while I haven't tried the EKS steps yet, can you able to share an example Go application that exhibits the problem and the compiler version?

ddelnano avatar Dec 12 '24 18:12 ddelnano

Hi, thanks for picking this up! I will provide you with the sample app later, but just as quick note - if I rotate EKS nodes so that they use AL2 AMIs, the tracing then works with everything else being exactly the same.

SamuraiPrinciple avatar Dec 12 '24 19:12 SamuraiPrinciple

No problem! Looking forward to seeing the example app as I didn't have luck with the reproduction on EKS.

ddelnano avatar Dec 12 '24 21:12 ddelnano

https://github.com/samuraiprinciple/pixie-repro

SamuraiPrinciple avatar Dec 12 '24 22:12 SamuraiPrinciple

The repo contains a simple go program that polls https://example.com as well as a Dockerfile that can be used to build an image. There is also a very simple k8s deployment manifest that runs two containers (one for HTTP 1.1 and the other for HTTP/2).

SamuraiPrinciple avatar Dec 12 '24 22:12 SamuraiPrinciple

Thanks for the quick response. Does the HTTP/2 case work on AL2? Pixie's HTTP/2 tracing should only work for GRPC and when Go binaries have debug symbols (docs).

screen

As for the HTTP 1.1 case, can you explain more on the success case (when running on AL2)? I crafted a sample application very similar to yours and didn't see any traffic for AL2 or AL2023. I didn't consider my case a reproduction since I thought it was more similar to #899, which was never fully understood.

I'll be able to continue debugging this on my end, but I'd still like to understand the AL2 case to help narrow down where the problem might be.

ddelnano avatar Dec 12 '24 22:12 ddelnano

Hi, mega-thanks for the quick reply! To be honest, we're not using HTTP/2 (i.e. we're actively disabling it - long story, but it's problematic for various reasons, unrelated to pixie), so I've only included that 'just in case'. I can confirm that on AL2 I can indeed see payloads for outbound TLS requests (HTTP 1.1).

SamuraiPrinciple avatar Dec 12 '24 22:12 SamuraiPrinciple

For your AL2 environment, can you run the PEM with --stirling_conn_trace_pid=${pid_of_working_app} and upload the PEM log to this issue?

You can add that arg to the DaemonSet even though the PID won't match on the other instances. Just make sure to grab the logs from the instance that has the application with the PID.

ddelnano avatar Dec 12 '24 22:12 ddelnano

Will get on it. In the meantime, here is what I can see for the node group running AL2

Screenshot 2024-12-12 at 22 56 15

SamuraiPrinciple avatar Dec 12 '24 22:12 SamuraiPrinciple

I'm deploying pixie using a helm chart - do you happen to know what the quickest way is to pass an argument to pem daemonset? or do I have to manually update the pem daemonset?

SamuraiPrinciple avatar Dec 12 '24 23:12 SamuraiPrinciple

vizier-pem-fvgxt.log

I think I've got it...

SamuraiPrinciple avatar Dec 12 '24 23:12 SamuraiPrinciple

For completeness, the patches config setting is probably the only way to do that and a helm example can be found here.

Thanks for the logs! That is exactly what I am looking for. Need some time to further investigate, but I believe I have the information I need for now!

ddelnano avatar Dec 12 '24 23:12 ddelnano

Ok, I've tracked down the issue. The get_goid function is broken on AL23. I made the following changes to print out all the variables and here is the output that I'm seeing.

diff --git a/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h b/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h
index 02cc46a3c..3667a490b 100644
--- a/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h
+++ b/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h
@@ -95,12 +95,14 @@ static inline uint64_t get_goid(struct pt_regs* ctx) {
   uint64_t id = bpf_get_current_pid_tgid();
   uint32_t tgid = id >> 32;
   struct go_common_symaddrs_t* common_symaddrs = go_common_symaddrs_map.lookup(&tgid);
+  bpf_trace_printk("get_goid: tgid=%d, common_symaddrs=%p\n", tgid, common_symaddrs);
   if (common_symaddrs == NULL) {
     return 0;
   }

   // Get fsbase from `struct task_struct`.
   const struct task_struct* task_ptr = (struct task_struct*)bpf_get_current_task();
+  bpf_trace_printk("get_goid: task_ptr=%p\n", task_ptr);
   if (!task_ptr) {
     return 0;
   }
@@ -118,6 +120,7 @@ static inline uint64_t get_goid(struct pt_regs* ctx) {
   size_t g_addr;
   bpf_probe_read_user(&g_addr, sizeof(void*), (void*)(fs_base + common_symaddrs->g_addr_offset));
   bpf_probe_read_user(&goid, sizeof(void*), (void*)(g_addr + common_symaddrs->g_goid_offset));
+  bpf_trace_printk("fs_base=%p get_goid: g_addr=%p, goid=%d\n", fs_base, g_addr, goid);
   return goid;
 }

  http-requester-7557    [001] ...11 14174.901214: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0
  http-requester-7557    [001] ...11 14174.901218: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0
             olm-3807    [000] .N.11 14177.957893: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0
             olm-3807    [000] .N.11 14177.958153: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0

bpf-printk-logs.txt

ddelnano avatar Dec 13 '24 21:12 ddelnano

@SamuraiPrinciple can you try installing the kernel-devel.x86_64 package on all AL23 hosts and see if tracing works?

I was able to verify that the issue above is because Pixie's prepackaged headers (what are used if kernel headers aren't available on host) is causing this thread lookup to fail. The fsbase is essential for locating the goid, so incorrectly reading that value breaks tracing.

Essentially AL23 has enough kernel differences (back ports, etc) compared to vanilla linux that Pixie's pre-packaged headers aren't compatible. Our docs mention that it's highly recommended to install linux headers although we haven't had a good way to surface this to end users. Unfortunately these kernel header incompatibilities manifest in strange ways, so it's usually time consuming to identify the problem.

Coincidentally I've been working on #2051 and it is getting close to making it in a release. I intend to make that check part of the Pixie cli and helm install process in addition to the px collect-logs command. That would have allowed us to catch this much sooner and rule out the headers being a problem.

ddelnano avatar Dec 14 '24 06:12 ddelnano

Hi! Many thanks for the investigation and the info. I can confirm that installing linux headers package indeed resolves the issue. The HTTP 1.1 tls connections can be traced (I have noticed that gzipped responses are not rendered in UI - "resp_body: <removed: non-text content-type>,"). Happy to close this issue.

SamuraiPrinciple avatar Dec 14 '24 10:12 SamuraiPrinciple

No problem, sorry it took some time for me to dig into this and greatly appreciate your help throughout the process!

I'm hopeful that surfacing the lack of linux headers in Pixie's diagnostic tools will help uncover and fix these problems before they become month long bugs! That tooling should be making its way in a release in the next few weeks (across vizier, operator, cli, etc).

ddelnano avatar Dec 17 '24 22:12 ddelnano

Glad I could be useful :) and many, many thanks for your help with this!

SamuraiPrinciple avatar Dec 18 '24 18:12 SamuraiPrinciple