Conditional breakpoints are very slow
After fixing issue #260 the conditional breakpoints do work, but they are very slow.
I think one of the reasons may be that we use DebuggerCore::get_state/set_state, which do many unneeded actions (filling AVX/FPU/etc.). These are called in both Debugger::handle_trap() and Debugger::resume_execution(). We'd better just try to get instruction pointer (be it PTRACE_PEEKUSER or whatever). (I'll try to look into this).
Another spot sysprof points to is MemoryRegions::sync(), which spends most time in PlatformProcess::regions(), which in turn spends most time in process_map_line(). I'm not sure whether this call to MemoryRegions::sync() is needed here (does breakpoint class need it?).
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
No, breakpoints don't specifically need it.
The memory region sync currently happens on every event because unfortunately on UNIX like systems, there is no intrinsic event for memory map changes like windows has. And since we never know what the handler will want to do on any given event, the conservative thing to do is sync once before handling each event.
Could it be faster to hash /proc/$PID/maps and compare hashes instead of full reparsing?
possibly, it's worth trying. Could also switch to using a regex to parse the map instead of a long chain of if's, dunno if that would be faster or slower though. Would need to benchmark.
On State side, I've measured performance improvement from 84 seconds to 62 seconds after doing this experimental commit.
For comparison, GDB hits this breakpoint in 22 seconds.
On another machine, with these changes and also commenting out call to update_menu_state(RUNNING) in Debugger::resume_execution() I got improvement from 17 seconds down to 8! So this call seems to be the first target to fix.
The biggest problem appears to be due to repaints triggered by this call. If only comment out the call to update_menu_state on your current master, then timing still decreases by a factor exceeding two (compare with the experimental commit above which only improves things by 25%).
very good finds, I'll check out how to be apply some of these
BTW, how are you timing this? I want to make sure that I'm doing comparable tests (please be as specific as possible)
Well, I didn't automate timing. I just create the breakpoint after starting special binary with the exact same sequence of bytes as in #260, then press F9 and listen to wall clock counting seconds until the breakpoint is hit. I'm sure you can automate it better, I just didn't try.
fair enough, I wonder if we can do some rtdsc stuff in the app itself to see how long it thinks it too. Dunno if it'll get messed up by the bouncing back and forth between debugger and debuggee., we'll see
rdtsc is unreliable as it counts ticks, which may differ depending on difference between clocks of different cores. Better use std::chrono::steady_clock for this.
So (hitting the BP and skipping 0x5000 times):
with the update_menu_state(RUNNING) in: ~9000ms
without the update_menu_state(RUNNING) in: ~6000ms
A reduction of 30% is nothing to ignore, this indeed seems like the place to look.
Narrowing it down, it is specifically edb::v1::set_status that is occupying almost all of that 30%, looking into ways to make this better.
I did an experiment with boost hash, at least on my machine there was no measurable benefit for it. Since I the code has to at least look at each line of the map in order to determine the hash. I think that work seems to be comparable to the work of just processing the line fully.
Also, I did an experiment with using a QRegExp and unfortunately, while it made the code a lot simpler, it also was about half the speed :-(.
So, please do a little testing and see if there is a measurable improvement for you with the master branch, and let me know.
Thanks
Your commit d437d0a improves timing from 73 seconds to 42. I.e. now it's 42% faster.
Applying my commit 1797fdd1d2905911c8ca7dae5cb6625162900072 on top of that, I get from 42 seconds down to 38 (not much improvement, but still repeatably measurable on my system). And after additinally applying a39e4c99550595691d765e8d18c9dc6a4912a331, it goes to 26 seconds.
interesting, I think we can actually improve 1797fdd by using boost::hash_combine, instead of doing a hash of hashes.
I'll play with it some more. Perhaps Qt IO is much slower than iostreams? When I tested, I tried to do it with a slightly different approach than yours, I'll see what literally using your technique gives.
Thanks!
I've pushed 634d182d that uses boost::hash_combine so it no longer needs to do heap allocations to get the total hash. On my system, It improved times by 50%!
One thing to note, is that my latest commit which makes write_bytes work in terms of /proc/<PID>/mem has a speed regression here (I'm fairly certain it's due to this). My plan is to rework the code such that for reads/writes smaller than sizeof(long), we use the older PTRACE_POKE/PEEK approach.
I've added a fast path for the code to read and write bytes. On my system, this provides comparable performance to code from 634d182. @10110111, if you get a chance, try to check out how it performs, and if we have a regression here. But please make sure to compile in release mode (I thought it was doing terrible until I realized that I was compiling in debug mode :-P)
Didn't see any measurable difference between ead0569012928ac188d477df0bc3b461385b51cc and 634d182d2692ca67374cfff4959ae7789a605a8e. I.e. no regression detected, all seems to be fine.
BTW, which is the fast path? I failed to spot the relevant change after your previous comment.
In bf8f1dea, there is some code to do quick and dirty read/writes when len == 1
Ah, I see why the confusion. When I mentioned 634d182, I was trying to refer to the last revision which I knew did not regress in speed. After that, I made some other changes, which I suspected may have a regression, which I hopefully addressed in bf8f1de.
So the ideal test would be to compare 634d182 to bf8f1de :-)
Yeah, before bf8f1dea15bf050d699f2289a7b879bb3e79f0eb I measured 25 seconds, after it 22, same as at 634d182d2692ca67374cfff4959ae7789a605a8e. Thus, you've successfully fixed the regression.
excellent to hear. So the question is, how fast do we need to make it, in order to consider this bug closed? "Slow" is a subjective measurement. Perhaps we can try to compare against like gdb?
I'd like to have it as fast as we can. I.e. at least incorporate something like a39e4c99550595691d765e8d18c9dc6a4912a331. It improves timings from 22 seconds to 17 on my system, and this is even faster than GDB (which takes 22 seconds). I'm not sure though that I compiled GDB in release mode, so you better try testing it too.
I admit a39e4c99550595691d765e8d18c9dc6a4912a331 will require some refactoring of DebuggerCore though, so not sure if it's fast to implement correctly.
OK, just as a data point: for 0x10000 conditional break point hits:
| Debugger | Duration |
|---|---|
| gdb | 26702ms |
| edb | 10630ms |
So we're doing great when comparing to the competition :-).
I agree that it would be nice to have a way to get just a small portion of the state to allow it to make it faster.
Yea, let's think about a clean way to augment the API to support changes equivalent to a39e4c9. Once we've done that (and implemented it), I think it would be safe to consider the bug "fixed"
Agreed.
On thing that I'm currently playing with, is a cleaner API for IDebugger. The plan (only half implemented in the current code base), is the have IDebugger provide all the primitives for only basic debugging operations, nothing process or thread specific. Then you call IDebugger::process() to get access to the attached process, where you can start/stop/read/write/etc.
Finally, next comes IProcess::threads() and IProcess::current() which will give you either a list of IThreads or a pointer to the current IThread, where you will have access to thread specific functions, such as start/stop/get_state/set_state/etc.
In IThread, we can have some handy functions such as IThread::instruction_pointer() to get the current IP of that thread, and such.
I think this is sufficiently abstract as to not limit us in the future, but also allows for a fairly efficient implementation.
Of course, I'm always open to alternatives if you have ideas as well :-)
Looks nice. Won't we eventually want to be able to query/set virtually all GPRs individually, not just instruction pointer and the like?
It's possible, but not necessarily required.
For things which happen at human speed, get_state/set_state is fast enough as is, no point in shaving milliseconds when the user is typing in a value to set.
So, we may be able to get away with having specialized functions for things which are truly universal (IP, SP, etc...), but with that in mind, I'd like to avoid having the abstract interface lock us into x86 family only if we can. The arch specific code in DebuggerCore can always do a lot of the arch specific stuff for the front end without exposing too much via the API.
I have some more thoughts on this, which would probably best be discussed via regular email though.
I was rather thinking of expression evaluation, not things happening at human speed. But well, this indeed seems to be a premature optimization. The evaluation could involve multiple registers in general, so I think this indeed shouldn't be much of a concern for now.