pixie icon indicating copy to clipboard operation
pixie copied to clipboard

Pixie Misses Events from Short Lived Processes/Pods

Open tewaro opened this issue 2 years ago • 1 comments

Describe the bug In many cases, Pixie fails to report HTTP events for short-lived processes, including for short-lived Kubernetes Pods and for short-lived child-processes within longer-living Pods. This is because Pixie's data collector Stirling fails to collect the necessary process metadata. This metadata is used to reconcile communication traces with their source/destination pod(s).

Currently, running processes have their process id (pid) communicated to Stirling by each data capture ebpf probe. In order to properly associate this pid with a kubernetes pod, a lookup procedure in the Stirling's userspace periodically probes Linux's sysfs in order to collect all pids in each cgroup, along with that cgroup's metadata. The procedure to associate a pid with a kubernetes pod does a lookup using the pid to find the cgroup containing the process. The cgroup path name is related to the pod unique id (pod uid) and they are able to match them using regex (code here). However pids of short-lived processes will leave the cgroup once the process terminates. Thus if the process terminates before a lookup occurs, then the data for that pid cannot be exported from stirling.

To Reproduce Steps to reproduce the behavior:

  1. Deploy pixie
  2. Start a container and ssh into it
  3. Run curl example.com
  4. Check Pixie Dashboard and you'll see that the behavior was not captured in the http_data script.

Expected behavior The Behavior of Short Running processes should be captured. There are several ways to achieve this but this is the desired high level behavior.

App information (please complete the following information): This is an issue discussed with the Pixie team, and impacts pixie in all its current versions.

Additional Information This is a known issue to the Pixie Team, and I am working with them to fix it. Here is a link to the evolving design document.

tewaro avatar Jul 26 '23 22:07 tewaro

Please view this pixie branch that has a mechanism to deploy a bug triggering test.

tewaro avatar Aug 01 '23 22:08 tewaro