panda Update to modern QEMU!

Update to modern QEMU!

Open AndrewFasano opened this issue 4 years ago • 48 comments

We've been talking about this for a bit but haven't started work on it. Creating this issue to track progress.

We're currently forked off of Qemu at version 2.9.1. We should update to 4+. At the time of writing, the latest version if 4.1.

We'll likely need to disable MTTCG to avoid significantly changing the record/replay model.

The main tasks I see for now are:

[ ] Actually do the git merge and handle merge conflicts to create a (likely broken) branch with all the commits
[ ] Get callbacks to run
[ ] Capture correct recordings
[ ] Replay recordings correctly
[ ] Library mode: build qemu as a library
[ ] Pypanda testing
[ ] Extensive testing

Mar 20 '20 14:03 AndrewFasano

Just to give it my 2 cents: I recently upgraded qira from qemu 2.x to qemu 4.0 instead of 4.1. At least for that effort, the changes after 4.0 were getting so severe that porting to 4.0 before going to 4.1 made sense. This may, of course, not be true for PANDA, but it may be easier to port PANDA to some intermediate version of qemu than going to 4.1 all at once.

Mar 23 '20 05:03 janbbeck

judging by the amount of monkeypatching I needed to do to get pandas as it currently exists to build on my system, I almost think it would be easier to collect what is different between pandas and the qemu version its based on and just rewrite it against qemu4.

Mar 24 '20 19:03 hanetzer

@janbbeck Thanks for the tip. 4.0 sounds like it might be a more realistic target.

@hanetzer It's unlikely that any of us have time to do a full rewrite given that there are 2k commits to PANDA since we forked. If we had time, that would certainly be the cleanest way. Also- you shouldn't need to do any monkeypatching- if you've had build issues can you open an issue? Our CI has been able to build PANDA on clean machines without any problems.

Mar 24 '20 19:03 AndrewFasano

@AndrewFasano I'm running on gentoo ~amd64 (read: bleeding on the edge), so there's a lot of new gcc diagnostics and such that make the build fail in creative ways.

While I have a human looking at this, I'll open an issue about what I'm trying to do and what is/isn't/idk working.

Mar 24 '20 19:03 hanetzer

@hanetzer Ah, okay, that makes more sense then. We just fixed some gcc7-related errors in the last few weeks but that's still probably not new enough for you.

Mar 24 '20 20:03 AndrewFasano

@AndrewFasano yeah. I'm currently working under a deadline and this is the last tool I can think of to help me do what I want, but once I either fail the task or succeed I'm more than willing to help put in the work on the update. Think you could eyeball #577 and toss me some suggestions?

Mar 24 '20 20:03 hanetzer

fwiw, panda builds just fine for me with gcc 7 8 ad 9 under ubuntu 19.04, as long as werror is suppressed. no monkeying necessary, using this type of configure:

./configure --target-list=x86_64-softmmu,i386-softmmu,arm-softmmu,ppc-softmmu --prefix=/home/jan/Downloads/panda/panda/scripts/panda/build/install --python=/usr/bin/python2 --disable-vhost-net --extra-cflags=-DXC_WANT_COMPAT_DEVICEMODEL_API --extra-cflags=-DOSI_PROC_EVENTS --extra-cflags=-DOSI_MAX_PROC=256 --extra-cflags=-DOSI_LINUX_PSDEBUG --extra-cflags=-Wformat-truncation=0 --disable-werror --cc=gcc-9 --cxx=g++-9 --host-cc=gcc-9

just change to compiler of your choice.

I just rebuilt panda with gcc 7, 8 and 9 and booted each time into a ubuntu live dvd and noticed no issues. HTH

Mar 25 '20 12:03 janbbeck

Bumping versions is hard, but not impossible. About a year ago, I worked on bringing PANDA up to 2.9.1 from an in-development version of QEMU 2.9. That itself was an undertaking, but mostly because I had to figure out what MTTCG changes were made and favor the PANDA version.

I would recommend bumping versions incrementally. I know you really want QEMU 4 (or maybe even now, QEMU 5) but I suspect this won't be easy.

I'd follow this approach, as it's what I did to bring PANDA to 2.9.1.

Pick a version, recommend 2.10.2.
Make a list of commits that need to be merged in.
Merge and bisect over the commits to be merged to see where stuff breaks.
Test, etc.
Rinse and repeat for 2.11.2, ... 5.0, etc.

While this is somewhat tedious there's probably a fair amount of automation you could do and you'll be in a working state at each step of the way and avoid huge merge conflicts.

Mar 25 '20 17:03 nathanjackson

I would like to point out that one of the major reasons I stopped at 4.0.0 for the qira port is that after that they added a plugin architecture of some sort that greatly changed the source files. It sure appered tailored for the sort of thing qira and panda do. It looked a lot to me like the correct thing to do from that point forward is re-do qira as a plugin - and that was more than I was willing to bite off :p I am bringing this up, because maybe - just maybe - the qemu plugin stuff is actually the way panda should be done in newer versions of qemu. If that is the case, I am sure there will be no end of discussion between porting piecemeal or just biting the bullet and making a plugin.

But I think it's worth a look before deciding to go piece by piece to 4.0.0 and beyond.

Thoughts?

Mar 25 '20 17:03 janbbeck

here more info: https://github.com/qemu/qemu/blob/master/docs/devel/tcg-plugins.rst

Mar 25 '20 17:03 janbbeck

I've looked at the tcg plugins a little and I'm really excited that qemu is finally starting to support using it for analysis, but I think there's still a long way to go before it could support everything we use/provide with panda. The biggest issue I see for now is that the TCG plugins only support passive observation so you can't modify a system during its execution.

Mar 25 '20 17:03 AndrewFasano

That is good to know! I suspect the code changes in qemu to support the plugins make porting panda painful....

Mar 25 '20 17:03 janbbeck

I guess most of our plugins do passive analysis, but a lot of my research involves mutating guest state so I'm probably a bit biased when I consider that shortcoming to be a dealbreaker.

Mar 25 '20 17:03 AndrewFasano

Is there some particular shortcoming that necessitates a move away from the current qemu base?

Mar 25 '20 17:03 janbbeck

@mariusmue first brought it up so he might want to chime in. I mainly want support for more machine types and whatever stability/performance fixes they've made.

Mar 25 '20 17:03 AndrewFasano

Hi all, In my opinion, an update to a newer version of Qemu benefits foremost emulation and analysis of non x86-based systems. As @AndrewFasano mentioned, there are just more machines, as well as architectures implemented. Furthermore, when it comes to avatar-related changes, being on modern versions of QEMU would it make easier to sync changes between the two frameworks, but this is just a minor point.

MTTCG is luckily guarded well behind preprocessor macros, so it should be easy to keep it deactivated. Furthermore, I think somewhen around qemu 3.0 was a tremendous change/cleanup in the QAPI, I personally like the new organization better.

When it comes to tcg-plugins, yes, they are great for passive monitoring of VMState. In case PANDA wants to enforce a strict separation between record and analysis, I would suggest that recording is re-implemented on top of this API (if possible), as this would allow recording on stock-builds of qemu as drop-in solution. In theory, this should even allow for records independent of the QEMU version, but I think there may various problems arise in praxis. (E.g., different peripheral implementations.) In any case, the actual analysis/replay instances of PANDA would still need to hook at various in the codebase, but distinguishing whether we are in record or replay mode should be an artifact of the past, allowing for a cleaner codebase, easier to migrate to upcoming versions of qemu as well.

Hence, if these changes are going to happen, I would not plead for a complete rewrite of PANDA, but for identifying and minimizing the locations PANDA actually hooks into QEMU's core logic.

Mar 26 '20 10:03 mariusmue

I'm merging Qemu 5.0 rc1 into PANDA as part of my thesis. I hope that going from rc1 to the release version wont be a big problem.

I jumped right into it (I do question myself if that was the smart way) and I have the conflicts down to cpus.c and softmmu_template.h, and whatever errors the compiler will throw at me when I first build it. There were a few parts that were drastically changed, mostly around the TCG.

I'm waiting on the reply from my supervisor about how and when we can publish the source, but I guess I'll know more next week and by then I think I'll also have the merge commit.

Apr 10 '20 20:04 glueckself

Hey @glueckself, that's great to hear! After you finish the merge, I'd expect there to be lots of bugs (unless you're really good at merging) as PANDA usually needs additional updates when core QEMU things change. Once you have a merge commit, I think there are a number of us who would be willing to help track down bugs if you're able to share!

Apr 10 '20 20:04 AndrewFasano

Some thoughts about the (welcome) migration to QEMU 4 codebase.

Since PANDA will continue following the development of QEMU, maybe it would be helpful to label PANDA releases based on the underlying QEMU codebase? E.g. current version would be PANDA2 (based on QEMU 2.x) codebase. Next version would be PANDA4 (based on QEMU 4.x). This would make it a bit easier to discuss issues while both versions are in use.

I can see this issue growing longer and longer. Maybe a separate branch should be created while stabilizing/working out bugs with the QEMU4-based panda? Then issues related to the branch can be reported individually. A new QEMU4 or PANDA4 label for the issue tracker can be used to filter issues quickly.

Finally, it would probably be good to also announce a draft time-plan for the deprecation of the current code-base. This would encourage the community to migrate any incompatible code to the new version.

Apr 11 '20 17:04 m000

If my supervisor is ok with me publishing the code on Github, I can fork this repository. The fork would then provide a separate issue tracker. When the migration is completed, you could merge my fork back here. Please note that I'm migrating PANDA to Qemu 5.

Apr 11 '20 18:04 glueckself

@m000 I like your suggestions - then we get to jump from PANDA2 to PANDA4 and skip all the work of PANDA3 ;) I think we should wait until we have the new version at least partly working before we plan any deprecation timelines.

I think we'll take a look at @glueckself's code if/when that's available and then pull it into a branch on this repo, instead of tracking it in a separate repo. QEMU 5 sounds great if you're able to get it to work. Then we can go right up to PANDA5 :)

Apr 11 '20 18:04 AndrewFasano

I've now uploaded the branch containing the merge, however, I'm still fixing compiler errors. I have never merged anything this big before, so I expect there will be some mistakes in there. I would appreciate any feedback. :) I also haven't looked into testing of Qemu and PANDA yet, I think that'll be the next step after getting it to compile.

Regarding the checklist: my goal is to get record/replay running (and probably only for i386 and arm). After that, I'll probably have to switch over to my thesis (btw, it's only a BSc thesis).

Apr 13 '20 22:04 glueckself

There will be a public regression testing framework available soon.

I just looked at your branch: 23610 commits ahead, 654 commits behind panda-re:master.... Godspeed.

Apr 13 '20 22:04 nathanjackson

I've got qemu-system-i386 to build, however, some of the "fixes" I made are... a bit ugly (especially ef6a13e1f81cf029fc2e953c2406622717157b45). It manages to boot PC-MOS/386. And, with a workaround, Linux. But only a test image. Debian hangs on some udev soft lockups.

Issues I've discovered so far:

It's very slow. I've tried to look into it, but I couldn't find anything.
The main thread has 100% CPU load (might be what causes it to be slow).
The cdrom Device timeouts. This causes Linux to fail to boot. Workaround is to start qemu with -nodefaults -vga std (I'm not sure if there is a way to remove only the cdrom drive). Haven't looked into it yet.
Soft lock ups in the guest.
It crashes when starting with -llvm or on begin_record.
The qemu tests/qtest/boot-serial-test hangs.

I've marked some places where I don't understand what's going on (or what should be going on) with "//TODO: panda:". I'll try to sort out as many as I can. Also, there are a lot of warnings. I haven't looked into them yet, probably there are bad ones in there.

My next goal is to get everything to compile and clean up the warnings and TODOs.

I have problems with the C/C++ mixing. Qemu started to use some C specific stuff in (e.g. __builtin_types_compatible_p() in include/qemu/atomic.h ) and g++ doesn't support that. I'm not sure if C-linkages can solve that (i.e. if there is a extern "C" missing somewhere) or if that has to be implemented for C++. Would it be possible for someone to support me there?

UPDATE: Now everything builds. However, there is also one more commit of questionable quality. Also, the include dependencies are not set up properly so that the make must be run with -j at least twice to make use of a race condition to create plog.pb.h.

Apr 17 '20 12:04 glueckself

@glueckself I would highly recommend incrementally merging, even within a QEMU version. I would try getting PANDA to the next released version of QEMU first (2.10.X I think). Tracking down these issues will not be easy, certainly not something where someone on the internet can just point you in the right direction.

Apr 17 '20 17:04 nathanjackson

@nathanjackson do you think that I'm on a dead end here? Because I've got it to build everything by now and my next step would be to clean stuff up. I guess most of the issues are somewhere near a "//TODO: panda:" comment and some are pointed to by compiler warnings that I've ignored for now. I don't think I would resolve most of those conflicts any better if they came one-by-one instead of all-at-one. To be honest, I would hate to have to throw everything away (however, I do prefer to throw it away instead of looking for bugs for the next two years).

Apr 17 '20 17:04 glueckself

I'm hesitant to say you're on a "dead end", but the issues you've listed would indicate you're running into complicated CPU loop\iothread issues. Where you start debugging is anyone's guess and it doesn't sound like you've even started looking at record/replay. Debugging this will be extremely difficult because QEMU is complicated and the code is multithreaded.

Since you're merging in a huge number of commits, how will you know which ones are causing problem(s) if you don't incrementally merge? How do you know your conflict resolution was correct for one issue while debugging another? Bisecting won't work well here because you've probably introduced more than one problem into the code. You need to do the opposite: introduce some commits -> test, then rinse and repeat. There are a large number of commits that should apply without you having to resolve conflicts, I promise this isn't as bad as it sounds.

If you try the incremental route, I don't think you have to throw everything away per se. By attempting the big merge, you've already learned some things about the code. You'll be able to use that knowledge as you're incrementally merging.

I don't want to discourage you from doing this, I think everyone including me wants PANDA to be on a newer QEMU. I'm just saying how I would go about doing this. If you think that you can make it work with one big merge, by all means do it.

Apr 17 '20 18:04 nathanjackson

Do you think it makes sense for me to try and resolve the places I've marked and maybe dig deeper into one or two of issues? Maybe over the next two weeks or so, and if after that it's still a huge mess, then I could try the incremental route?

But you're right, bisecting becomes an issue then. Would such a merge be acceptable for maintaining anyway?

No worries, I'm thankful for the feedback :) I've never done anything like this before, so I expect a set-back or two :)

Apr 17 '20 18:04 glueckself

Do you think it makes sense for me to try and resolve the places I've marked and maybe dig deeper into one or two of issues? Maybe over the next two weeks or so, and if after that it's still a huge mess, then I could try the incremental route?

Honestly, no. I think any one of those issues you mentioned is probably more than two weeks of work if you were doing this full-time. You mentioned a bachelor's thesis. My guess is that you have other work to do as well. I think an incremental merge will get you almost immediate results.

But you're right, bisecting becomes an issue then. Would such a merge be acceptable for maintaining anyway?

If we do this incrementally or in one commit, I think bisecting will be hard for awhile anyway because of the changes being made. I wouldn't let that stop you in either case. I was really saying during the merge, figuring out which upstream QEMU change broke something would be next to impossible with one giant merge.

Apr 17 '20 19:04 nathanjackson

@glueckself FWIW, I could not agree more with Nathan about everything he has told you.

Apr 18 '20 03:04 janbbeck

panda panda copied to clipboard

Update to modern QEMU!

panda
panda copied to clipboard