parca-agent Parca agent makes Oracle Kubernetes Engine worker nodes down

The subject may sound weird, but that's what I have experienced. Hope you guys shed some light. :)

I started testing Parca in OKE (Oracle Kubernetes Engine) and found installing Parca agent made the worker node unresponsive to SSH and must be hard-rebooted. At first, I suspected the issue is with OKE or its underlying cloud infrastructure. But over time, I was able to isolate that this issue only happens after Parca agent is installed. The reproduction was done across multiple clusters, node pools, nodes, etc.

But I still don't know what exactly caused this issue. I've scanned the system stats and logs but they looked clean. The system resources were fine. I could have missed something as I have no idea where to look at.

Parca agent showed this log from all nodes it was installed, but I am not sure if it is relevant (anyway, the log says "please report").

level=error ts=2021-12-31T05:29:33.514162476Z caller=debuginfo.go:259 namespace=kube-system pod=coredns-94d6cc8b6-br22d container=coredns component=debuginfoextractor msg="external binutils command call failed" output="objcopy: BFD (GNU Binutils for Debian) 2.31.1 internal error, aborting at ../../bfd/elf.c:7085 in rewrite_elf_program_headerobjcopy: Please report this bug." file=/proc/16763/root/coredns

Environment

Oracle Kubernetes Engine (Kubernetes 1.21)
OS: Oracle Linux Server 7.9 (a cousin to RHEL 7.9)

Steps to reproduce

kubectl create namespace parca
kubectl apply -f kubectl apply -f https://github.com/parca-dev/parca-agent/releases/download/v0.3.0/kubernetes-manifest.yaml
Watch the node become "not ready" and unresponsive within minutes. Sometimes in a few mins, sometimes takes a long time (so the nodes in the cluster which Parca agent is installed as DaemonSet went unresponsive one by one).
hard-reboot the node

Dec 31 '21 23:12 tonychoe

cc @kakkoyun

Jan 03 '22 13:01 brancz

xref: https://github.com/parca-dev/parca/issues/514

Jan 03 '22 13:01 kakkoyun

@tonychoe Thanks for reporting the issue. I'm looking into this. I would be glad if you could share more information about your current setup.

Since the log is about an external program that we use to extract debug information on the host, have you seen any zombie processes before the node went down? It's just a wild guess, but we might be leaking file descriptors.

That being said, I'm looking into this in more detail.

Jan 03 '22 13:01 kakkoyun

I haven't had a chance to reproduce this. I would appreciate any additional information @tonychoe

Jan 05 '22 10:01 kakkoyun

@tonychoe We have a new release. I'd appreciate if you could give it a try https://github.com/parca-dev/parca-agent/releases/tag/v0.4.0

Jan 07 '22 08:01 kakkoyun

@kakkoyun sorry I have been busy; will get back after trying v.04.0 and/or collecting any evidences.

Jan 07 '22 15:01 tonychoe

@kakkoyun I did some tests today and here's the summary.

Parca agent v0.4.0 didn't make a difference.
So, I've tested total three version of agent (0.4.0, 0.3.0, 0.1.0). All of them caused the K8S worker node unresponsive in a few minutes after deployment.
The test was done on the K8S cluster built with Oracle Linux 7.9 and 8.4. Interestingly, the node w/ OL 8.4 was OK for two hours, so I was ex cited and then deployed the agent to a larger cluster with 8.4. All node went unresponsive. :(
I didn't see an issue with the file descriptors. I monitored the number until I lost connection to the node, for several times.
Maybe the issue is elsewhere...?

Jan 10 '22 05:01 tonychoe

Thanks for trying it out and detailed report.

It was a long shot anyway. I'll try to replicate your setup. I don't know what I don't know right now.

When you have time maybe you could add --log-level="debug" flag and check if you see any suspicious log messages? That'd be great.

Jan 10 '22 07:01 kakkoyun

@kakkoyun

Thank you. I've been using the debug mode but couldn't find some clues. All errors I got look normal - the agent complained about not finding symbols, etc.

I've ran another test today.

(1) Installed Parca server/agent as binary (non-K8S) on Oracle Linux (OL) 8.4. I haven't find any issue in the past 6 hours. Maybe the OS has nothing to do with this issue? (Note: I didn't test with OL 7.9 which uses older version of glibc than the one Parca needs.)

(2) Installed Parca server/agent again onto K8S - all worker nodes are OL 8.4. Then the same outcome: the K8a nodes went unresponsive one by one.

(3) From (2), I configured the agent to profile just one app by using --pod-label-selector. Okay for few mins, I could see graphs, then went unresponsive.

Jan 12 '22 06:01 tonychoe

Thanks for taking the time and investigate further @tonychoe.

I think something bloating the system and taking it down. The time it takes is related to the number of targets parca-agent attempts to profile.

I need to find a way to get my hands on an Oracle Linux based cluster to investigate further.

Jan 12 '22 08:01 kakkoyun

That's a good theory. @kakkoyun

I was continuing the experiment (3). The app I was profiling (Jaeger hotrod demo) moved to the surviving node. I could see profiling working with the app for an hour, then as of this writing, went unresponsive. So how long it takes varies.

Jan 12 '22 08:01 tonychoe

Experiment (1) - running parca agent/server on non-K8S, regular VM w/ OL 8.4 - went with zero issue almost 1 day.

Experiment (4) - started running parca agent/server on K8S, and agent is configured to not profile a thing by using --pod-label-selector with bogus label. It's been running fine for 10 hours. No node went down.

So, yeah, looks like the issue only occurs when parca agent profiles apps on K8S.

It seems what apps parca agent profiled didn't matter because the node went down when there was nothing but apps in kube-systems and also went down when the agent was configured to profile a specific app.

FYI - I have been running Pixie on the same environment and didn't have an issue.

...

[Edit on Jan 18, 2022] Experiment (1) and (4) are still running without facing an issue. If there is any build you want me to test, just let me know.

Jan 13 '22 05:01 tonychoe

Hi @kakkoyun

I tested the latest Parca agent in the same Kubernetes cluster. This time the nodes were alive more than a day over the weekend, but sadly this morning, I found the nodes became unresponsive. I found no clues and have no good idea how to debug this.

parca server - v0.10.0
parca agent - v0.7.0
Oracle Linux 7.9
Oracle Kubernetes Engine - 1.21.5

Oracle Cloud offers 30-day free trial including OKE, if you want to reproduce this in the same set up I have.

Apr 04 '22 18:04 tonychoe

@tonychoe Thanks for updating and terribly sorry that I haven't paid enough attention to this. But it's on my list, I'll get there soonish.

At least, we know that we kind of improved things between version. There must be a some sort of resource leak that causes saturation on the host.

Apr 05 '22 07:04 kakkoyun

@kakkoyun thank you.

This is the test i did after my previous post: (1) Ran parca-agent to profiling just one microservice, went fine for the past 7 days (2) Last night, removed the label selector from parca-agent to profile all microservices running in the K8S (OKE) cluster. All nodes went unresponsive in an hour.

Here's the parca-agent logs in "error" level, de-duplicated by signature. They were generated after I started (2). I don't think I saw these errors during (1).

Do they give you any idea?

level=error ts=2022-04-12T07:27:03.293315929Z caller=debuginfo.go:98 component=debuginfo buildid=68395a7557305f7871497355416c45675f5855572f636e70484341707a6d37316b73744236795a79432f59644f464f6d447349446a7179577530663670772f4f58726555325035657931557331563177716b39 path=/proc/13638/root/parca msg="failed to upload debug information" err="failed to open temp file for debug information: open /tmp/68395a7557305f7871497355416c45675f5855572f636e70484341707a6d37316b73744236795a79432f59644f464f6d447349446a7179577530663670772f4f58726555325035657931557331563177716b39.debuginfo: no such file or directory"

level=error ts=2022-04-12T07:26:41.740645304Z caller=debuginfo.go:98 component=debuginfo buildid=e3356c2be38c62c16d37c45685a6197d2fb678c7 path=/proc/20282/root/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 msg="failed to upload debug information" err="failed to open temp file for debug information: open /tmp/e3356c2be38c62c16d37c45685a6197d2fb678c7.debuginfo: no such file or directory"
		
level=error ts=2022-04-12T07:26:41.387837562Z caller=debuginfo.go:98 component=debuginfo buildid=54eef5ce96cf37cb175b0d93186836ca1caf470c path=/proc/29903/root/lib/x86_64-linux-gnu/libc.so.6 msg="failed to upload debug information" err="failed to open temp file for debug information: open /tmp/54eef5ce96cf37cb175b0d93186836ca1caf470c.debuginfo: no such file or directory"

level=error ts=2022-04-12T07:26:41.096429388Z caller=debuginfo.go:98 component=debuginfo buildid=d3583c742dd47aaa860c5ae0c0c5bdbcd2d54f61 path=/proc/20282/root/lib/x86_64-linux-gnu/libdl-2.28.so msg="failed to upload debug information" err="failed to open temp file for debug information: open /tmp/d3583c742dd47aaa860c5ae0c0c5bdbcd2d54f61.debuginfo: no such file or directory"

level=error ts=2022-04-12T07:26:40.522310399Z caller=debuginfo.go:98 component=debuginfo buildid=5018237bbf012b4094027fd0b96fc22a24496ea4 path=/proc/2092/root/lib/x86_64-linux-gnu/libpthread.so.0 msg="failed to upload debug information" err="failed to open temp file for debug information: open /tmp/5018237bbf012b4094027fd0b96fc22a24496ea4.debuginfo: no such file or directory"

level=error ts=2022-04-12T07:26:40.418574478Z caller=debuginfo.go:98 component=debuginfo buildid=5018237bbf012b4094027fd0b96fc22a24496ea4 path=/proc/2092/root/lib/x86_64-linux-gnu/libpthread.so.0 msg="failed to upload debug information" err="failed to hash file /tmp/5018237bbf012b4094027fd0b96fc22a24496ea4.debuginfo: failed to open file: open /tmp/5018237bbf012b4094027fd0b96fc22a24496ea4.debuginfo: no such file or directory"

**(MOST FREQUENT ERROR MSG)** level=error ts=2022-04-12T07:26:39.309533552Z caller=debuginfo.go:98 component=debuginfo buildid=097a02695f7e64f8519b0bf68cd3de78a3d17adc path=/proc/30903/root/usr/lib64/ld-2.17.so msg="failed to upload debug information" err="failed to hash file /tmp/097a02695f7e64f8519b0bf68cd3de78a3d17adc.debuginfo: failed to open file: open /tmp/097a02695f7e64f8519b0bf68cd3de78a3d17adc.debuginfo: no such file or directory"

level=error ts=2022-04-12T07:26:34.427947364Z caller=debuginfo.go:98 component=debuginfo buildid=c979d4a9453c70c0b44dabae4d212fb5b48f81b3 path=/proc/20137/root/usr/lib64/libc-2.17.so msg="failed to upload debug information" err="failed to open temp file for debug information: open /tmp/c979d4a9453c70c0b44dabae4d212fb5b48f81b3.debuginfo: no such file or directory"

level=error ts=2022-04-12T07:26:24.252931678Z caller=debuginfo.go:98 component=debuginfo buildid=48596f474f6344753477517958586169745036382f756549645875676a687a314672715079657945302f4d4f716931767a69746e7633732d47724c4c2d792f3461706a37692d707a7565433044674865643877 path=/proc/19824/root/go/bin/hotrod-linux msg="failed to upload debug information" err="failed to open temp file for debug information: open /tmp/48596f474f6344753477517958586169745036382f756549645875676a687a314672715079657945302f4d4f716931767a69746e7633732d47724c4c2d792f3461706a37692d707a7565433044674865643877.debuginfo: no such file or directory"

Apr 12 '22 17:04 tonychoe

@kakkoyun This is the warning message I often see.

level=warn ts=2022-04-12T18:45:30.544771804Z caller=extract.go:125 component=debuginfo component=extractor msg="failed to remove temporary file" file=48596f474f6344753477517958586169745036382f756549645875676a687a314672715079657945302f4d4f716931767a69746e7633732d47724c4c2d792f3461706a37692d707a7565433044674865643877.stripped err="remove 48596f474f6344753477517958586169745036382f756549645875676a687a314672715079657945302f4d4f716931767a69746e7633732d47724c4c2d792f3461706a37692d707a7565433044674865643877.stripped: read-only file system"

It's weird. parca-agent mounted /tmp volume as empty-dir successfully yet it complains.

It complains that it can't remove longnumber.stripped file due to the "read-only file system" (??).

I can see the file in the K8S node's filesystem but in the different name longname.debuginfo.stripped

Any idea why this error is occurring?

drwxrwxrwx. 2 root root     4096 Apr 12 18:45 .
drwxr-xr-x. 3 root root       17 Apr 12 18:37 ..
-rwxr-xr-x. 1 root root 16883232 Apr 12 18:45 48596f474f6344753477517958586169745036382f756549645875676a687a314672715079657945302f4d4f716931767a69746e7633732d47724c4c2d792f3461706a37692d707a7565433044674865643877.debuginfo.stripped

Apr 12 '22 18:04 tonychoe

@tonychoe Thanks for detailed logs. This is something I can work with. I'll find a solution for this.

Yet, I don't think this could bring down the nodes. This is something else, and I believe I've introduced it with the latest version. I'm going to create another issue for it.

Apr 13 '22 05:04 kakkoyun

@kakkoyun I've tested the agent v0.8.0 and unfortunately the issue is still there - the K8S nodes went unresponsive quickly.

May 25 '22 20:05 tonychoe

@kakkoyun Hello, I was able to replicate @tonychoe's issue here.

Environment: Oracle Kubernetes Engine Cluster uses Kubernetes v1.23.4 Image: Oracle-Linux-8.5-2022.04.26-0-OKE-1.23.4-250 VM shape: 8 CPUs, 128 GB Parca server v0.11.1 Parca agent v0.8.1

Steps to reproduce

Create OKE cluster with above specifications
Followed the steps here up to and including setting the port forward for port 7071 for active profilers. The Parca server and active profiler pages worked as expected before the node crashed.
Some subset of nodes that the parca-agent pods run on crash, sending the nodes into NotReady state (as given by kubectl get nodes
All other pods that are on a node crashed by parca-agent get stuck in Terminating state and migrate to other nodes
Another node crashes and this repeats until all nodes in the cluster have crashed, requiring hard rebooting each node.

Observations On my first attempt using a node shape of 2 CPUs and 32 GB, it took about 45 minutes from creating the parca-agent DaemonSets for the nodepool's set of 2 nodes to crash. On my next attempt using a new node shape of 8 CPUs and 128 GB, the nodes crashed in less than 10 minutes. On my attempt following that, using the same new node shape of 8 CPUs and 128 GB, one node in the nodepool crashed very quickly but the remaining node in the nodepool has been functioning for the last 95 minutes as of this writing. Attached is the log (debug flag on) of this 95 minute-surviving node. parca-agent-hmcwh-pool3.log

Jun 02 '22 00:06 jcr179

We're working on stability improvements, and this is at the top of our list. @jcr179 @tonychoe Thanks for the amazing reports. And I'm terribly sorry that I still couldn't get to this (company off-site and kubecon, it was hectic).

When I check the logs, I see many DNS failures and many errors about retrieving the kernel stack. We need to check both and understand what's different with the Oracle kernel image.

I'll ask for your further patience.

Jun 02 '22 07:06 kakkoyun

Hi @kakkoyun , I spent some more time working on this and came up with some more data points that might help us out.

Some key observations to corroborate @tonychoe's findings:

(using the same environment as in my first comment: Oracle Kubernetes Engine Cluster uses Kubernetes v1.23.4 Image: Oracle-Linux-8.5-2022.04.26-0-OKE-1.23.4-250 VM shape: 8 CPUs, 128 GB Parca server v0.11.1 Parca agent v0.8.1)

it seems to be the parca agent, not the server, that's causing OKE worker nodes to crash
OKE worker nodes crash one by one since parca-agent is deployed as a daemonset meaning one instance of a parca-agent pod is on each node
deploying parca agent on its own (no parca server and only the kube-system namespace pods active) didn’t cause a crash for 21 hours (at the 21 hour mark, I tried deploying parca server and one of the two nodes in my nodepool crashed)
deploying apps to profile while parca agent was safely active for 21 hours caused nodes to crash really quickly
pod security policies being disabled/enabled in the OCI console doesn’t seem to make a difference (they are disabled by default, meaning pod security policies are not used to restrict the pods accepted into the cluster. Enabled means the pod security policy admission controller will use pod security policies to restrict the pods accepted into the cluster.)
in the logs of the parca-agent pods, I saw many instances of level=debug ts=2022-06-03T20:43:20.304462286Z caller=write_client.go:92 msg="batch write client failed to send profiles" retry=1.045314713s count=0 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp: lookup parca.parca.svc.cluster.local on 10.96.5.5:53: read udp 10.244.1.2:58438->10.96.5.5:53: read: connection refused\"" despite the subnets of the OKE worker nodes allowing ingress/egress to/from 10.96.0.0/16

Attached is a log from one of the parca-agent pods I obtained using kubectl logs -f -n parca parca-agent-cz96b > parca-agent-cz96b-failure.log where I did the following actions in order:

Confirmed that my new worker node 10.0.27.240 is working and that I can ssh into it, left this ssh session running
Created the parca-agent resources, with only kube-system pods also running in my cluster (no parca server and no additional apps to profile)
Started creating this log file at about 1:43 PM pacific time
Created the parca server resources at about 1:47 PM pacific time
Created the online-boutique.yaml (also attached) resources at about 1:51 PM pacific time
The node crashed within the next minute (status became NotReady)
My ssh connection to my worker node was closed
Confirmed that I could no longer ssh into my new worker node 10.0.27.240

Currently I'm not sure if the issue lies with OKE or the image I use or something else. I will try some other distros starting with RHEL as Oracle Linux is based off RHEL and report back.

Attached is the entire log file described by the above process parca-agent-cz96b-failure.log, just the level=warn messages parca-agent-cz96b-failure-warn.log (there were no level=error messages), and the online-boutique.txt (normally named online-boutique.yaml but github didn't allow me to upload it as .yaml extension) resources I tried to profile which crashed the node immediately after successfully (according to kubectl) provisioning. parca-agent-cz96b-failure.log parca-agent-cz96b-failure-warn.log online-boutique.txt

Jun 03 '22 21:06 jcr179

deploying parca agent on its own (no parca server and only the kube-system namespace pods active) didn’t cause a crash for 21 hours (at the 21 hour mark, I tried deploying parca server and one of the two nodes in my nodepool crashed)

This is the most interesting bit I believe.

cc @kakkoyun @javierhonduco this makes me think that this has something to do with object file extraction or upload (it's also possible that sending the raw data might be the culprit, but I'd be more surprised if it was that)

Jun 07 '22 08:06 brancz

@kakkoyun Hi, as an update I just tried again with Parca Server v0.12.1 and Parca Agent v0.9.1 and it worked for 15 minutes before crashing a worker node... with Parca Agent v0.8 it did not work.

Aug 29 '22 20:08 jcr179

@kakkoyun Hi, as an update I just tried again with Parca Server v0.12.1 and Parca Agent v0.9.1 and it worked for 15 minutes before crashing a worker node... with Parca Agent v0.8 it did not work.

Thanks for trying it out. We're constantly trying to improve the resources we use. We believe the next major release (v0.10.0) will have a huge impact on the volume of resources needed to profile the host. We start to create a single eBPF profiler for the whole host. And as I've already added this issue to the milestone, I'll be testing the release candidate on the Oracle Kubernetes engine. Please stay tuned the release should be there in the next couple of weeks.

Aug 30 '22 15:08 kakkoyun

A candidate for promised but belated release is here https://github.com/parca-dev/parca-agent/releases/tag/v0.10.0-rc.0

@tonychoe @jcr179 I would appreciate if you could give it a try

Nov 15 '22 21:11 kakkoyun

I would highly recommend running the tests with v0.10.0-rc.1 now. It has improved significantly.

Nov 21 '22 11:11 metalmatze

@kakkoyun Our Parca server/agent have been running smooth on OKE the past 2 weeks. Thank you for your work! 😄

Dec 02 '22 21:12 jcr179

@kakkoyun Our Parca server/agent have been running smooth on OKE the past 2 weeks. Thank you for your work! 😄

Great news!

Looking forward to hearing back from @tonychoe

Dec 05 '22 12:12 kakkoyun

@kakkoyun same here, no issue found in the past two weeks. thanks a lot for this fix. now this issue can be closed. :)

Dec 05 '22 17:12 tonychoe

parca-agent parca-agent copied to clipboard

Parca agent makes Oracle Kubernetes Engine worker nodes down

parca-agent
parca-agent copied to clipboard