panda
panda copied to clipboard
Replaying a recording fails with assert error on g_main_context_acquire or segfault
Description
I am trying to get a basic record new or replay mechanism running. Following the documentation, I believe this should be either just a call to panda.run_replay
or panda.record_cmd
. However, there seems to be some nondeterminism in replaying the previously recorded traces with (Py)PANDA when doing so. Recording works fine, but replaying results in either a segfault or other assertion errors. This behavior seems consistent with both i386 and x86_64 images of both Linux and Windows. Am I doing something wrong or is this a bug?
How to reproduce
- Run a new panda instance from docker hub
$ docker run --rm -it -v $(pwd):/local pandare/panda /bin/bash
- Create a new basic
test.py
file in/local
with the following contents:from pandare import Panda import os panda = Panda(generic='x86_64') @panda.queue_blocking def driver(): recording_name = "linux64recording" # Does a recording exist? If so, replay it. Otherwise, make a new recording. if os.path.isfile(f"{recording_name}-rr-snp"): print("Replaying") panda.run_replay(recording_name) else: print("Recording") panda.revert_sync('root') panda.record_cmd("whoami", recording_name=recording_name) panda.stop_run() panda.run()
- Run the script once and observe normal execution with output similar to the following:
root@5a4e2fa4f486:/local# python test.py using generic x86_64 Downloading required file: https://www.dropbox.com/s/4avqfxqemd29i5j/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2?dl=1 /root/.panda/bionic-server-cloud 100%[========================================================>] 2.84G 22.2MB/s in 2m 33s Validating file hash os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr] PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr [PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_sf2yccvkw,server,nowait -monitor unix:/tmp/pypanda_mxmdsta9o,server,nowait] Replaying loading snapshot ... done. opening nondet log for read : ./linux64recording-rr-nondet.log ./linux64recording-rr-nondet.log: 3253417 instrs total. Replay completed successfully Exiting cpu_handle_execption loop
- Now run the script a second time, and notice it crashing. I have seen several variations of output for this, but the following two seem to be the most prevalent:
root@5a4e2fa4f486:/local# python test.py using generic x86_64 os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr] PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr [PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_sifghicd1,server,nowait -monitor unix:/tmp/pypanda_m8xy0ak1w,server,nowait] Replaying loading snapshot loading snapshot rdev-iowatch-serial0: Illegal RAM offset ffc000 rdev-iowatch-serial0: error while loading state section id 4(ram) Failed to load vmstate Failed to start replay rdev-iowatch-serial0: Unknown savevm section 4 Failed to load vmstate Failed to start replay free(): corrupted unsorted chunks Segmentation fault (core dumped)
root@5a4e2fa4f486:/local# python test.py using generic x86_64 os_name=[linux-64-ubuntu:4.15.0-72-generic-noaslr-nokaslr] PANDA[core]:os_familyno=2 bits=64 os_details=ubuntu:4.15.0-72-generic-noaslr-nokaslr [PYPANDA] Panda args: [/usr/local/lib/python3.8/dist-packages/pandare/data/x86_64-softmmu/libpanda-x86_64.so -L /usr/local/share/panda /root/.panda/bionic-server-cloudimg-amd64-noaslr-nokaslr.qcow2 -display none -m 1024 -serial unix:/tmp/pypanda_s079elggc,server,nowait -monitor unix:/tmp/pypanda_mdbfu5knm,server,nowait] Replaying loading snapshot loading snapshot ... done. opening nondet log for read : ./linux64recording-rr-nondet.log ... done. opening nondet log for read : ./linux64recording-rr-nondet.log ./linux64recording-rr-nondet.log: 3253417 instrs total. Replay completed successfully Exiting cpu_handle_execption loop ** GLib:ERROR:../../../glib/gmain.c:3375:g_main_context_acquire: assertion failed: (context->owner_count == 0) Bail out! GLib:ERROR:../../../glib/gmain.c:3375:g_main_context_acquire: assertion failed: (context->owner_count == 0) Aborted (core dumped)
Interestingly enough, the second output seems to indicate the replay completed successfully, but then still crashes afterwards.
Additional Info
Host OS: MANJARO 23.1.0 Kernel: 6.1.64 Docker Version: 24.0.7 panda-system-x86_64 --version: QEMU emulator version 2.9.1 (-dirty) CPU: Intel i7-11800H RAM: 16GB
I have a guess as to what's going wrong. When the recording exists, you call run_replay
which runs the replay. Then you still hit the call to run
which will try to resume the emulation from where things left off after the replay which will be something weird. If that's the issue, I think we should raise an error instead of crashing in this strange way.
Does the issue go away if you refactor things to avoid the call to run
after run_replay
, like this:
from pandare import Panda
import os
panda = Panda(generic='x86_64')
recording_name = "linux64recording"
if os.path.isfile(f"{recording_name}-rr-snp"):
print("Replaying")
panda.run_replay(recording_name)
else:
@panda.queue_blocking
def driver():
print("Recording")
panda.revert_sync('root')
panda.record_cmd("whoami", recording_name=recording_name)
panda.stop_run()
panda.run()
Perfect, that seems to have fixed the problem!
This makes me a bit confused about the naming of the available functions though. Considering that there are other run_xxx
methods (such as run_serial_cmd
) provided in the examples folder that are actually put into a panda.queue_blocking
tagged function. Is run_replay
substantially different from these other run methods? When should something be put in a queue and when shouldn't it be queued?
Also, indeed, a more descriptive error would've been nice :).
The run
method runs a live guest while run_replay
runs a replay, generally you want one or the other, not both.
I definitely see what you're saying where it's confusing how the run_serial_cmd
method has a similar style name and being totally different! Open to suggestions for how we could improve these names in the future!
@AndrewFasano Thanks for the explanation, that definitely clears it up!
I definitely see what you're saying where it's confusing how the run_serial_cmd method has a similar style name and being totally different! Open to suggestions for how we could improve these names in the future!
Some ideas come to mind:
- I would consider distinguishing between running a PANDA analysis, and actually executing a program in the guest VM, perhaps using the
run_
andexec_
/execute_
prefixes. - I would also consider renaming
run()
to something likerun_live_guest()
, to make it more explicit and no confusion can happen.