Debugging with dlv inside gvisor with systrap faults
Description
Hey gvisor folks !
I have a use case where I need to debug a go binary with dlv, running w/ gvisor. When I set a breakpoint, reach it, and then try to step - dlv is either stuck or faults.
This only happens when running with systrap, using --platform=ptrace, it works as expected; as the ptrace platform usage is discouraged, I would really appreciate your help w/ making it work under systrap.
It happens with the latest runsc, go and dlv releases.
Steps to reproduce
mkdir /tmp/gvisor-dlv
cd /tmp/gvisor-dlv
# get runsc
ARCH=$(uname -m)
wget -q https://storage.googleapis.com/gvisor/releases/release/latest/$ARCH/runsc
chmod +x runsc
export PATH=$PATH:/tmp/gvisor-dlv
# get go binary
wget -qO- https://go.dev/dl/go1.24.2.linux-$(echo $ARCH | sed -E 's/x86_/amd/;s/aarch/arm/').tar.gz | tar xzf -
export PATH=$PATH:/tmp/gvisor-dlv/go/bin
# create oci bundle
mkdir bundle
mkdir --mode=0755 bundle/rootfs
# get dlv binary
GOBIN=/tmp/gvisor-dlv/bundle/rootfs go install github.com/go-delve/delve/cmd/dlv@latest
# compile go binary
go mod init main
cat <<EOF > main.go
package main
func main() {
for i := range 1000 {
print(i)
}
}
EOF
go build -gcflags 'all=-N -l' -o bundle/rootfs/main .
# create mount dir for uds
mkdir mnt
chmod 777 mnt
# create oci config
cat <<EOF > bundle/config.json
{
"ociVersion": "1.0.0",
"process": {
"args": [
"/dlv",
"--listen=unix:/mnt/dlv.sock",
"--headless",
"--api-version=2",
"--accept-multiclient",
"exec",
"/main"
]
},
"root": {
"path": "rootfs"
},
"mounts": [
{
"destination": "/mnt",
"type": "bind",
"source": "/tmp/gvisor-dlv/mnt"
}
]
}
EOF
# run gvisor
sudo -E env PATH=$PATH runsc --host-uds=all run -bundle /tmp/gvisor-dlv/bundle test
# > API server listening at: /mnt/dlv.sock
# connect to dlv in another session
cd /tmp/gvisor-dlv
sudo chmod 666 mnt/dlv.sock
bundle/rootfs/dlv connect unix:mnt/dlv.sock
# > (dlv)
# set breakpoint
(dlv) b main.main
# > Breakpoint 1 set at 0x470b0a for main.main() ./main.go:3
# reach breakpoint
(dlv) c
# > [Breakpoint 1] main.main() ./main.go:3 (hits goroutine(1):1 total:1) (PC: 0x470b0a)
# next
(dlv) n
# dlv should now hang or fault, it might take a few runs to hit the fault flow;
# on rare occasions it works properly, if so rerunning will usually trigger the hang / fault
# > unexpected fault address 0xc0000547d0
#> [signal SIGSEGV: segmentation violation code=0x2 addr=0xc0000547d0 pc=0xc0000547d0]
# > [runtime-fatal-throw] runtime.throw() ./go/src/runtime/panic.go:1092 (hits goroutine(1):1 total:1) (PC: # 0x468084)
runsc version
runsc version release-20250414.0
spec: 1.1.0-rc.1
uname
Linux ip-10-0-1-16 6.8.0-1024-aws #26-Ubuntu SMP Tue Feb 18 17:22:37 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
runsc debug logs
Maybe @ayushr2 / @manninglucas ? 🙏🏻
Friendly ping, maybe @XueSongTap / @nlacasse ?
Hi @dany74q, you may have to compile runsc with some special flags, can you try make dev BAZEL_OPTIONS="-c dbg --define gotags=debug" and see if that makes a difference?
@manninglucas thanks for the response !
It didn't made a different unfortunately - just to clarify, I'm not trying to debug gvisor, but I'm running the dlv binary with runsc to debug some other binary on the fs.
Small ping, maybe @konstantin-s-bogom / @avagin ?
Could you try out this patch:
diff --git a/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go b/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
--- a/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
+++ b/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
@@ -243,6 +243,7 @@ func (app *runApp) execute(t *Task) task
if t.ptraceSinglestep {
clearSinglestep = !t.Arch().SingleStep()
t.Arch().SetSingleStep()
+ t.p.FullStateChanged()
}
t.tg.pidns.owner.mu.RUnlock()
}
@@ -253,15 +254,16 @@ func (app *runApp) execute(t *Task) task
t.accountTaskGoroutineLeave(TaskGoroutineRunningApp)
region.End()
- if clearSinglestep {
- t.Arch().ClearSingleStep()
- }
if t.hasTracer() {
if e := t.p.PullFullState(t.MemoryManager().AddressSpace(), t.Arch()); e != nil {
t.Warningf("Unable to pull a full state: %v", e)
err = e
}
}
+ if clearSinglestep {
+ t.Arch().ClearSingleStep()
+ t.p.FullStateChanged()
+ }
switch err {
case nil:
Hey @avagin !
Tried the patch and unfortunately I hit the same behavior (either fault or hang) with systrap w/ the same reproducer attached above. Anything else worth checking ?
I've tried a few times and have also been able to reproduce the problem.
Could you please apply this patch in addition to the previous one?
diff --git a/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go b/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go
--- a/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go
+++ b/google3/third_party/gvisor/pkg/sentry/platform/systrap/systrap.go
@@ -170,17 +170,19 @@ restart:
if err != nil {
return nil, hostarch.NoAccess, err
}
- if needPatch {
- s.usertrap.PatchSyscall(ctx, ac, mm)
- }
- if !isSyscall && linux.Signal(c.signalInfo.Signo) == linux.SIGILL {
- err := s.usertrap.HandleFault(ctx, ac, mm)
- if err == usertrap.ErrFaultSyscall {
- isSyscall = true
- } else if err == usertrap.ErrFaultRestart {
- goto restart
- } else if err != nil {
- ctx.Warningf("usertrap.HandleFault failed: %v", err)
+ if false {
+ if needPatch {
+ s.usertrap.PatchSyscall(ctx, ac, mm)
+ }
+ if !isSyscall && linux.Signal(c.signalInfo.Signo) == linux.SIGILL {
+ err := s.usertrap.HandleFault(ctx, ac, mm)
+ if err == usertrap.ErrFaultSyscall {
+ isSyscall = true
+ } else if err == usertrap.ErrFaultRestart {
+ goto restart
+ } else if err != nil {
+ ctx.Warningf("usertrap.HandleFault failed: %v", err)
+ }
}
}
With this patch applied, I can no longer reproduce the issue.
Hey @avagin - appreciate it, that does indeed work;
Tinkering w/ it a bit - I saw that it works without the previous patch, and by just skipping PatchSyscall call, keeping the sigkill flow below it.
It does look like an important flow to skip entirely though, what would be an effective way to tackle this ?
@avagin - Do you think the issue is within the patch logic, or should we skip patching ptrace / sigchld altogether ? I can prepare a patch, but not yet sure what would make most sense.
Hey @dany74q, you can just use the new flag --systrap-disable-syscall-patching for your usecase. A general solution is possible if we roll back the patches to their original state, but it will have to come later unless you want to try your hand at it. If you do look into it, the simplest way to do this would be to stop the entire subprocess that has triggered the need to rollback, rewrite the patched syscalls back to their original state, and then continue. Ideally, we'd also expose a per-stub-thread flag to prevent further patches on the stub side (though preventing patching from just the sentry is ok).