system76-scheduler icon indicating copy to clipboard operation
system76-scheduler copied to clipboard

fossilize-replay run by steam is not captured by execsnoop in amd64 architecture

Open taoky opened this issue 1 year ago • 7 comments

fossilize-replay is used by Steam to generate games' pre-caching shaders, and it uses all cores and is very CPU-consuming (and makes desktop very slow). system76-scheduler is supposed to limit its nice and IO to lowest. However, it could be found that fossilize-replay is niced as 14 (its default maybe?) instead of 19 when starting games from Steam.

After some debugging with strace, it seems that (unfortunately) steam is running under 32-bit mode, and execsnoop could not capture 32-bit apps' execve() without code modification. In execve_fnname = b.get_syscall_fnname("execve"), execsnoop by default gets __x64_sys_ as its prefix, and __ia32_compat_sys_ is not covered.

It seems that it could be a lot of trouble to make bcc upstream to cover all arch's symbols upstream (they have a lot of tools scripts), so maybe it is necessary to fork execsnoop inside system76-scheduler and make some modifications?

MRE:

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    if (execlp("vim", "vim", (char*) NULL) == -1) {
        perror("Failed to start vim");
        exit(EXIT_FAILURE);
    }
    return 0; // Unreachable, execlp replaces the process image if successful
}

Compiled by: gcc -m32 ./example.c -o example and run -- execsnoop could not catch this.

If modified with execve_fnname = "__ia32_compat_sys_execve", then it could get this exec. (Haven't tested with Steam's fossilize-replay though)

taoky avatar Jul 28 '23 08:07 taoky

Related issue: https://github.com/iovisor/bcc/issues/3668

taoky avatar Jul 28 '23 08:07 taoky

The long term goal is to replace execsnoop with a custom implementation optimized for system76-power. Which would preferably also be written in Rust. There are some bpf/bcc crates available on Crates.io, but I'm not currently familiar with how it all glues together yet, or what the recommended crates are today. If you're interested, you could help me with that.

mmstick avatar Jul 28 '23 10:07 mmstick

The long term goal is to replace execsnoop with a custom implementation optimized for system76-power. Which would preferably also be written in Rust. There are some bpf/bcc crates available on Crates.io, but I'm not currently familiar with how it all glues together yet, or what the recommended crates are today. If you're interested, you could help me with that.

This sounds really nice! I could try working on a rust bpf PoC when I have some time.

By the way, (if necessary I would create a new issue), I noticed that after receiving exec info from execsnoop it waits 2 seconds for latest cgroup info. But I think that it could be done better by let a new bpf program hooking cgroup_procs_write() to get all PID writes to cgroup.procs. This may help some scenarios when CPU-consuming processes is creating very frequently.

taoky avatar Jul 28 '23 11:07 taoky

That would be helpful, because the two second delay is because the scheduler's already parsing the new process's data before its cgroup and other data are assigned to it.

mmstick avatar Jul 28 '23 12:07 mmstick

Aya seems to be the most popular rust-idiomatic library for writing eBPF kernel and user space programs. Today I have taken some time written a working PoC with Aya.

There are some problems with Aya though:

  • Aya doesn't have a very good documentation about kprobing syscalls. For example, I have searched for a very long time to find how to get argument (reg) values from ProbeContext.
  • The PoC is working on the template Aya given, which seems to be designed for applications, not libraries.
  • For 32-bit apps in x86-64, Aya's builtin PtRegs could not be used as it is using a different calling convention from x86-64. I'm using some non-portable dirty methods to get correct arguments from it.
  • It is said that Aya supports CO-RE (Compile-once, run-anywhere), but I have only tested my PoC on Arch Linux, and I'm not sure what will happen on older kernels.
  • Aya template uses git repo as dependency directly in its Cargo.toml (like aya-bpf = { git = "https://github.com/aya-rs/aya" }) without any commit/branch/tag pinning.

taoky avatar Aug 01 '23 17:08 taoky

It may be worth asking Aya's developers for help with the less documented or tricky areas. Perhaps they'd be willing to accept contributions for improvements to their documentation and APIs?

For now, I'd accept a solution that can at least match parity with execsnoop-bpfcc. Though we can choose a better medium for communicating the inputs. Perhaps a zero-copy serializer like https://github.com/rkyv/rkyv, or something human-readable that's efficient to serialize and deserialize like https://kdl.dev/.

I tried running your proof of concept, and got this error:

failed to initialize eBPF logger: log event array AYA_LOGS doesn't exist

mmstick avatar Aug 03 '23 00:08 mmstick

For now, I'd accept a solution that can at least match parity with execsnoop-bpfcc. Though we can choose a better medium for communicating the inputs. Perhaps a zero-copy serializer like https://github.com/rkyv/rkyv, or something human-readable that's efficient to serialize and deserialize like https://kdl.dev/.

Still using bcc + Python and just adjusts the output of script sounds also OK and it doesn't involve a lot of refactoring.

I tried running your proof of concept, and got this error:

failed to initialize eBPF logger: log event array AYA_LOGS doesn't exist

This is expected as there're no aya logger (info!(), etc.) within eBPF programs. However, I have tested it today within a Debian 12 VM and unfortunately finds out that the "portability" aya claims does not seem to work. The execsnoop-poc-ebpf/src/vmlinux.rs is generated under Linux 6.4.x and it fails when accessing task->real_parent->tgid in Linux 6.1.x. Regenerating it in Linux 6.1.x and it works.

taoky avatar Aug 03 '23 09:08 taoky