gvisor Debugging with dlv inside gvisor with systrap faults

Description

Hey gvisor folks !

I have a use case where I need to debug a go binary with dlv, running w/ gvisor. When I set a breakpoint, reach it, and then try to step - dlv is either stuck or faults.

This only happens when running with systrap, using --platform=ptrace, it works as expected; as the ptrace platform usage is discouraged, I would really appreciate your help w/ making it work under systrap.

It happens with the latest runsc, go and dlv releases.

Steps to reproduce

mkdir /tmp/gvisor-dlv
cd /tmp/gvisor-dlv

# get runsc
ARCH=$(uname -m)
wget -q https://storage.googleapis.com/gvisor/releases/release/latest/$ARCH/runsc
chmod +x runsc
export PATH=$PATH:/tmp/gvisor-dlv

# get go binary
wget -qO- https://go.dev/dl/go1.24.2.linux-$(echo $ARCH | sed -E 's/x86_/amd/;s/aarch/arm/').tar.gz | tar xzf -
export PATH=$PATH:/tmp/gvisor-dlv/go/bin

# create oci bundle
mkdir bundle
mkdir --mode=0755 bundle/rootfs

# get dlv binary
GOBIN=/tmp/gvisor-dlv/bundle/rootfs go install github.com/go-delve/delve/cmd/dlv@latest

# compile go binary
go mod init main
cat <<EOF > main.go
package main

func main() {
        for i := range 1000 {
                print(i)
        }
}
EOF
go build -gcflags 'all=-N -l' -o bundle/rootfs/main .

# create mount dir for uds
mkdir mnt
chmod 777 mnt

# create oci config
cat <<EOF > bundle/config.json
{
    "ociVersion": "1.0.0",
    "process": {
        "args": [
            "/dlv",
            "--listen=unix:/mnt/dlv.sock",
            "--headless",
            "--api-version=2",
            "--accept-multiclient",
            "exec",
            "/main"
        ]
    },
    "root": {
        "path": "rootfs"
    },
    "mounts": [
        {
            "destination": "/mnt",
            "type": "bind",
            "source": "/tmp/gvisor-dlv/mnt"
        }
    ]
}
EOF

# run gvisor
sudo -E env PATH=$PATH runsc --host-uds=all run -bundle /tmp/gvisor-dlv/bundle test
# > API server listening at: /mnt/dlv.sock

# connect to dlv in another session
cd /tmp/gvisor-dlv
sudo chmod 666 mnt/dlv.sock
bundle/rootfs/dlv connect unix:mnt/dlv.sock
# > (dlv)

# set breakpoint 
(dlv) b main.main
# > Breakpoint 1 set at 0x470b0a for main.main() ./main.go:3

# reach breakpoint
(dlv) c
# > [Breakpoint 1] main.main() ./main.go:3 (hits goroutine(1):1 total:1) (PC: 0x470b0a)

# next
(dlv) n

# dlv should now hang or fault, it might take a few runs to hit the fault flow; 
# on rare occasions it works properly, if so rerunning will usually trigger the hang / fault
# > unexpected fault address 0xc0000547d0
#> [signal SIGSEGV: segmentation violation code=0x2 addr=0xc0000547d0 pc=0xc0000547d0]

# > [runtime-fatal-throw] runtime.throw() ./go/src/runtime/panic.go:1092 (hits goroutine(1):1 total:1) (PC: # 0x468084)

runsc version

runsc version release-20250414.0
spec: 1.1.0-rc.1

uname

Linux ip-10-0-1-16 6.8.0-1024-aws #26-Ubuntu SMP Tue Feb 18 17:22:37 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

runsc debug logs

runsc-debug-log.tar.gz

Apr 17 '25 21:04 dany74q

Maybe @ayushr2 / @manninglucas ? 🙏🏻

Apr 24 '25 19:04 dany74q

Friendly ping, maybe @XueSongTap / @nlacasse ?

Apr 27 '25 17:04 dany74q

Hi @dany74q, you may have to compile runsc with some special flags, can you try make dev BAZEL_OPTIONS="-c dbg --define gotags=debug" and see if that makes a difference?

Apr 28 '25 22:04 manninglucas

@manninglucas thanks for the response !

It didn't made a different unfortunately - just to clarify, I'm not trying to debug gvisor, but I'm running the dlv binary with runsc to debug some other binary on the fs.

Apr 29 '25 14:04 dany74q

Small ping, maybe @konstantin-s-bogom / @avagin ?

May 01 '25 18:05 dany74q

Could you try out this patch:

diff --git a/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go b/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
--- a/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
+++ b/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
@@ -243,6 +243,7 @@ func (app *runApp) execute(t *Task) task
                if t.ptraceSinglestep {
                        clearSinglestep = !t.Arch().SingleStep()
                        t.Arch().SetSingleStep()
+                       t.p.FullStateChanged()
                }
                t.tg.pidns.owner.mu.RUnlock()
        }
@@ -253,15 +254,16 @@ func (app *runApp) execute(t *Task) task
        t.accountTaskGoroutineLeave(TaskGoroutineRunningApp)
        region.End()
 
-       if clearSinglestep {
-               t.Arch().ClearSingleStep()
-       }
        if t.hasTracer() {
                if e := t.p.PullFullState(t.MemoryManager().AddressSpace(), t.Arch()); e != nil {
                        t.Warningf("Unable to pull a full state: %v", e)
                        err = e
                }
        }
+       if clearSinglestep {
+               t.Arch().ClearSingleStep()
+               t.p.FullStateChanged()
+       }
 
        switch err {
        case nil:

May 03 '25 01:05 avagin

Hey @avagin !

Tried the patch and unfortunately I hit the same behavior (either fault or hang) with systrap w/ the same reproducer attached above. Anything else worth checking ?

May 04 '25 04:05 dany74q

I've tried a few times and have also been able to reproduce the problem.

Could you please apply this patch in addition to the previous one?

diff --git a/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go b/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go
--- a/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go
+++ b/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go
@@ -170,17 +170,19 @@ restart:
        if err != nil {
                return nil, hostarch.NoAccess, err
        }
-       if needPatch {
-               s.usertrap.PatchSyscall(ctx, ac, mm)
-       }
-       if !isSyscall && linux.Signal(c.signalInfo.Signo) == linux.SIGILL {
-               err := s.usertrap.HandleFault(ctx, ac, mm)
-               if err == usertrap.ErrFaultSyscall {
-                       isSyscall = true
-               } else if err == usertrap.ErrFaultRestart {
-                       goto restart
-               } else if err != nil {
-                       ctx.Warningf("usertrap.HandleFault failed: %v", err)
+       if false {
+               if needPatch {
+                       s.usertrap.PatchSyscall(ctx, ac, mm)
+               }
+               if !isSyscall && linux.Signal(c.signalInfo.Signo) == linux.SIGILL {
+                       err := s.usertrap.HandleFault(ctx, ac, mm)
+                       if err == usertrap.ErrFaultSyscall {
+                               isSyscall = true
+                       } else if err == usertrap.ErrFaultRestart {
+                               goto restart
+                       } else if err != nil {
+                               ctx.Warningf("usertrap.HandleFault failed: %v", err)
+                       }
                }
        }

With this patch applied, I can no longer reproduce the issue.

May 05 '25 17:05 avagin

Hey @avagin - appreciate it, that does indeed work;

Tinkering w/ it a bit - I saw that it works without the previous patch, and by just skipping PatchSyscall call, keeping the sigkill flow below it.

It does look like an important flow to skip entirely though, what would be an effective way to tackle this ?

May 05 '25 18:05 dany74q

@avagin - Do you think the issue is within the patch logic, or should we skip patching ptrace / sigchld altogether ? I can prepare a patch, but not yet sure what would make most sense.

May 07 '25 06:05 dany74q

Hey @dany74q, you can just use the new flag --systrap-disable-syscall-patching for your usecase. A general solution is possible if we roll back the patches to their original state, but it will have to come later unless you want to try your hand at it. If you do look into it, the simplest way to do this would be to stop the entire subprocess that has triggered the need to rollback, rewrite the patched syscalls back to their original state, and then continue. Ideally, we'd also expose a per-stub-thread flag to prevent further patches on the stub side (though preventing patching from just the sentry is ok).

May 09 '25 17:05 konstantin-s-bogom