pyOCD icon indicating copy to clipboard operation
pyOCD copied to clipboard

RTOS: Make vector catches show originating thread

Open kjbracey opened this issue 5 years ago • 32 comments

This follows discussions with Alan Hayward, and research into the issue of showing HardFaults etc decently.

There has been discussion and attempts to patch GDB to let it unwind from the Main stack to the Process stack, so that when stopping inside a HardFault, we could see where the fault had come from. Discussion and links here:

https://bugs.launchpad.net/gcc-arm-embedded/+bug/1566054

After thought and discussion with Alan, I've concluded that that approach is not conceptually correct - it would better suit GDB's target process model to present the Handler Mode as a distinct thread from the Processes, and ensure they accurately show the full system state. This patch improves the RTX5 Thread provider to achieve this.

PyOCD's RTOS awareness already effectively achieves this, if you have a supported RTOS, and my previous patches have helped ensure that the view of the current thread is as up-to-date as possible when inside an exception.

In future it would make sense to have a "bare metal RTOS" provider for pyOCD to present Process and Main stacks as 2 threads, so the same effect is achieved without an actual RTOS.

This patch is an extra refinement step, which further special-cases the ARM vector catch stop when you have separate Process and Handler threads.

Before the patch, when stopping due to a HardFault vector catch, you would see:

   Thread #5 stopped due to SIGSEGV at HardFault_Handler
   #2 main thread (Running), PC at exception site
   #3 timer thread (Wait[MsgQueue])
   #4 idle thread (Ready)
   #5 Hander Mode thread (HardFault), PC at HardFault_Handler [GDB current]

This patch now gives you a more GDB-conventional view:

   Thread #2 stopped due to SIGSEGV at fault location
   #2 main thread (Running), PC at exception site [GDB current]
   #3 timer thread (Wait[MsgQueue])
   #4 idle thread (Ready)
   #5 Hander Mode thread (HardFault), PC at HardFault_Handler

This view better follows documented GDB behaviour that the "signal" is reported at the site it occurred. The conventional behaviour is that the system stops with PC showing the thread that received the signal, pointing at the instruction we would execute next in the foreground - for a SIGSEGV this would be the exception-generating instruction itself, which would be retried if the signal handler (or ARM exception handler) returned.

The ARM system state isn't modified - at the moment of stop the Handler Mode thread has come into existence, and its PC still shows the first instruction of HardFault_Handler.

On subsequent steps, we will step into the handler, and the Handler Mode thread will be current, just as on a Unix system we might step into the signal handler (albeit in the same thread, rather than a separate Handler thread).

This patch only refines behaviour when an exception arises from the Process stack. If the exception actually happens Handler mode, or in Thread mode using the Main stack you will still see "SIGSEGV at HardFault_Handler". But at least you will be in the correct thread - the site will be 2 frames up the backtrace.

kjbracey avatar Oct 19 '18 12:10 kjbracey

This is very much a first draft - I will continue to experiment with the concept. Thoughts welcome.

Main thing I now note using it is that the typical null pointer write exception in mbed OS is an external bus fault, so this doesn't work that well without a

SCnSCB->ACTLR |=  SCnSCB_ACTLR_DISDEFWBUF_Msk;

to make such faults be synchronous.

kjbracey avatar Oct 19 '18 13:10 kjbracey

FYI @SeppoTakalo

kjbracey avatar Oct 19 '18 13:10 kjbracey

Hi @kjbracey-arm, thanks very much!

I'm actually a little surprised that before this patch, when stopping on a HardFault vector catch the RTX5 thread provider didn't show the PC at HardFault_Handler(), because the Argon thread provider does show it that way (if I understand correctly).

This weekend I'll play with this PR and compare it with the Argon thread provider's behaviour. I definitely want to keep behaviour consistent between all the thread providers. (Ultimately that will mean refactoring more functionality to be shared in the provider base class.)

And right, I pretty much always keep the write buffer disabled in debug builds so I get precise bus faults. It might be a useful feature to provide a user option for pyOCD to disable the write buffer (logged a todo).

flit avatar Oct 19 '18 20:10 flit

This is what I see when stopping at a HardFault with Argon: argon_hardfault_eclipse

This is showing gdb continuing the backtrace in Handler mode even though the EXC_RETURN value indicates return should be to Thread mode on the Process stack.

I do think gdb should be able to correctly unwind across exception frames, simply to match the way Cortex-M exception stacking works.

flit avatar Oct 21 '18 19:10 flit

the RTX5 thread provider didn't show the PC at HardFault_Handler(),

It did, this is about showing where that fault came from - that's what we really need to know, and we generally struggle to figure out.

I do think gdb should be able to correctly unwind across exception frames, simply to match the way Cortex-M exception stacking works.

That was the subject of the discussion with Alan. It is what I'd always assumed was needed, prior to having the RTX5 thread awareness, and it's part of what the patches I linked to were looking at. I started off intending to polish them off to do it.

But conceptually it's not anything GDB has ever done on any target, as far as I can tell. GDB cannot unwind exception frames that change processor mode on any platform.

GDB debugs processes, each of which have a single stack, using core registers (r0-r15). Making the leap from Main Stack+Handler to Process Stack+Thread mode by interpreting system registers is kind of out of its scope. Certainly possible - stitching two stacks together to view as a single process - but is it conceptually right? Should ARM-M be a special case? The fact it would be special, and add system registers to GDB for the first time, was part of the reason the patches stalled.

My current view is no. If a Cortex-M system is bare-metal enough to just be using the Main stack, then it's reasonable for GDB to be able to bare metal unwind the Main stack, which it can do with just core registers, in the absence of Thread awareness.

But if a system is using Process stack(s), then that implies an OS, which implies you could get the Thread awareness to take care of showing the Process stack(s) and Main stack(s) alongside each other.

As a minimum, you could arrange for the vanilla unknown-OS pyOCD thread provider to offer the Process stack from PSP whenever in handlers state (if PSP is non-0?). So GDB can see 2 threads.

This patch is kind of independent of whatever happens at the GDB end - it would work regardless of whether GDB stitched stacks together. But it would make sense to agree the principle of how stuff should be presented. If the assumption was that PSP is displayed as different thread(s), then it would be neatest for GDB to terminate unwinding as it sees a Handler returning to Process stack, on the assumption that the Process stack will be visible elsewhere.

So, behaviour of GDB when unwinding Handler->Process exception frame:

  • Current: keeps unwinding down Handler stack, showing pointless/incorrect info like OS entry SVC
  • Proposed by previous patches, and my previous view: jump over to current PSP
  • My current view: terminate unwinding, expect current (and alternate) PSPs to be shown as different thread(s)

kjbracey avatar Oct 22 '18 06:10 kjbracey

Rebased and split this up into as small a chunks as I can, and transferred most of the relevant changes I made to RTX5 into other RTOS providers.

One of the key improvements - the "actual current thread" was actually already present in the other providers - it appears RTX5 was deficient, so I've brought it into line.

Most of these commits are now straight bug fixes or innocuous refactorings.

Some then improve behaviour slightly - eg unavailable registers.

The last commit is then the key behaviour-changing one, which holds off from flipping "current thread" until after the vector catch stop.

kjbracey avatar Nov 02 '18 14:11 kjbracey

Thanks for the updated commits. I'll try this PR out while I'm working on a project this weekend.

flit avatar Nov 02 '18 21:11 flit

/morph test

flit avatar Nov 02 '18 21:11 flit

@MaureenHelm Can you review this PR to make sure you're happy with the changes to the Zephyr ThreadProvider? Thanks!

flit avatar Nov 11 '18 20:11 flit

If accepting as conceptually right, the last "rework vector catch" commit should also be extended to cover other providers - the "get live LR" one may also be relevant for some.

As the next step I'm currently looking at the "bare metal" version equivalent version of this - a thread provider with 2 threads for PSP and MSP. Could possibly act as an increased amount of common code for the RTOS providers, as well as eliminate the "no thread provider" special cases from the core.

kjbracey avatar Nov 12 '18 08:11 kjbracey

Excellent! I'm looking forward to the PSP/MSP provider.

flit avatar Nov 12 '18 23:11 flit

To be honest, I'm not sure how I feel about showing the faulting thread as current when the actual PC is at the HardFault_Handler. When break in the debugger through whatever means, I expect (or at least, I'm very much used to) the debugger showing the actual PC.

To me, the most important part of this PR is fixing up the running thread view. Showing the HardFault_Handler as active when breaking tells the user that something broke. They can then look at the running thread for the faulting instruction.

If you drop the last commit, I can go ahead and merge this. Then we can continue the discussion 😄

(And apologies it took me so long to fully digest this! I just completely overlooked the bit about the reported current thread until just now.)

flit avatar Nov 13 '18 00:11 flit

I expect (or at least, I'm very much used to) the debugger showing the actual PC.)

Yes, that's certainly what we're used to in "bare metal". Very much not what happens once you start putting stuff on top of the bare metal, and I've come around to the view that it is just expectation.

GDB has a high-level thread view - it was never designed for bare metal - and we have scope for a certain amount of interpretation in presenting that, and what we do now is a "literal to the hardware" interpretation - excessively so in that it's somewhat off as a standard GDB thread/signal view.

GDB documentation is very clear that the expectation is that the debugger stops showing the thread that the "signal" was delivered to. "SIGSEGV in thread X at YYY" is the truth, or at least as close as we can get to it for GDB's presentation. "SIGSEGV in thread MainStack at HardFault_Handler" is very much not it, unless the first instruction of HardFault_Handler has trapped itself.

Certainly this is a compromise - the PC is really in the MainStack/HandlerMode thread at the instant of stop. But then in Unix at the instant of stop the execution path is already "going in" to the signal handler - if you step from SIGSEGV it won't execute the instruction the visible PC is pointing at, it will head into the pending signal handler. So in both cases PC points at where we /were/, it's not what we will execute next. (But it's what we'd execute next if we returned from the signal handler)

(And with this patch, the HandlerMode thread with PC pointing at first instruction is still visible when the SIGSEGV happens, and any attempt to step will flip the view immediately to the HandlerMode thread if you don't switch manually.)

Some older ARMs (or at least the debugger I used way back) had the decency to actually stop on the exception-causing instruction /without/ entering the handler, which was ideal. No compromise needed on the view. Although it did mean then you got a bit stuck if you wanted to continue because a step would just retrigger the fault. Hard to either get past the instruction in the thread or into the handler. Would need to disable the catch temporarily and set a breakpoint in the handler. (This would have been an ARM7TDMI or ARM920T, I guess).

Anyway, sure, I can separate it. I'll put the fixes into a separate PR, so this can stay discussion of the final commit. (When I started this, I never realised how close the RTX5 view was to what I wanted, and it mainly just needed some exception-handling fixes).

kjbracey avatar Nov 13 '18 09:11 kjbracey

Rebased, and extended to all RTOS providers. Session option added to control behaviour.

Selection of HandlerModeThread as current when IPSR > 0 is now performed up in GDBServer, so is not required to be fitted to every ThreadProvider, and that centralises the session option logic. A logical next step would be to eliminate all HandlerModeThread references from the thread providers, and have it centrally added to the OS-derived list they supply.

If we wanted this to be more general a possible structuring would to separate RTOS ThreadProviders from "core" ThreadProviders. RTOS ones would provide all the banked Process stacks, and "core" ones would provide any extra handlers (just 1 for ARM-M, but IRQ/SVC/FIQ... for ARM-A).

kjbracey avatar Nov 15 '18 13:11 kjbracey

Okay, my last couple of edits broke something, so don't bother testing - will need to fix this up tomorrow. Feel free to comment though.

kjbracey avatar Nov 15 '18 14:11 kjbracey

Thanks again, especially for the session option!

flit avatar Nov 15 '18 20:11 flit

Fixed - seems to be working okay now.

kjbracey avatar Nov 16 '18 14:11 kjbracey

/morph test

flit avatar Nov 21 '18 20:11 flit

Showing the context in which a hardfault occurred rather than the hardfault ISR

I like the idea of this, but I don't like the fact that what the debugger is showing you doesn't match reality. My expectation for being in the context the fault occurred in is I can manipulate the PC and registers to replay the failure. This is not the case, as moving the PC back is either meaningless (As the thread won't execute again) or a problem if the PC from the fault context is moved without returning from the exception.

c1728p9 avatar Nov 26 '18 17:11 c1728p9

Showing the context in which a hardfault occurred rather than the hardfault ISR

My expectation for being in the context the fault occurred in is I can manipulate the PC and registers to replay the failure.

99% of C developers are not capable of doing this. For most, the PC is too abstract, and that is why debuggers like GDB have been invented for them. Also, you seem to assume knowledge of Cortex-M cores for embedded developers. I have seen quite opposite.

Therefore I very much like the debugger showing where the HardFault originated from, and if this means that register values don't match anymore, it is a limitation that I can happily live with. The stuff that I'm interested with have so far been the local variables, and parameters on where the hardfault happened.

SeppoTakalo avatar Nov 27 '18 10:11 SeppoTakalo

My expectation...

This PR does have a session option to retain current catch behaviour, so there is scope for taste/expectations. I guess we're just debating the default.

the debugger is showing you doesn't match reality.

I did consider and discard some other approaches because they would have (IMO) violated that rule. This one was close enough to make me comfortable.

My expectation for being in the context the fault occurred in is I can manipulate the PC and registers to replay the failure.

This change doesn't affect our ability to do that. (Or at least, I don't believe it does.) The only change is which thread is selected in GDB at the instant of vector catch stop, as far as I'm aware. Nothing else about the presentation is modified. If you just click on the "Handler" thread (or use the "thread" command to select it), you're back to where you were without this change.

Note: as far as I can see currently all writes to a thread's registers in pyOCD are treated as writes to live registers, so it only really behaves sensibly when GDB's selected thread is the real current one. If this were improved so that writes to non-live threads modified the parked OS state or exception return frame, I still don't think this change would affect the situation.

I agree with Seppo that wanting to manipulate the "raw" stack frame and registers, rather than looking at the variables of the faulting thread, is going to be 1% of cases, which is why I think the default should be the new form.

You're possibly operating at one extreme here where I suspect you'd like each GDB "thread" to be mapped raw to 1 CPU. Just see live registers, with no interpretation. That's the purest view, and you could extend it to SMP - 4 threads for 4 cores, so 4 sets of live registers. No danger of confusion.

But with all these pyOCD non-SMP RTOS (or Process+Main stack) approaches, there is still always exactly 1 "live" thread which is working like that, showing the live registers that you can also write to, while the others are "parked". GDB doesn't display the current thread directly, but you can infer it from which pops up when you stop. This change does mean there's an extra inference required - when it's a vector catch the Handler thread appears but is not immediately selected. I think that's an extra inference anyone capable of manually manipulating an exception unwind is also capable of.

(You can infer that the live CPU is really stopped at the first instruction of the handler by the fact that it was a SIGSEGV, and that the Handler thread exists at all - it doesn't if not in a handler).

kjbracey avatar Nov 27 '18 12:11 kjbracey

Rebased, no changes. But note that #453 ends up reworking the implementation significantly (and changes the option name to be generic).

kjbracey avatar Dec 13 '18 16:12 kjbracey

Rebased and retested.

kjbracey avatar Feb 28 '19 13:02 kjbracey

Thanks for rebasing! I'll merge this if you make the default for vector_catch_show_process as False. Sam Grove does not want this to be the default view (sorry).

flit avatar Feb 28 '19 20:02 flit

Missed this comment, or I would have discussed it with him when he was visiting last week...

kjbracey avatar Mar 06 '19 14:03 kjbracey

If we do want to take this without #453, I'll probably want to pull in at least the edit to the option name, to keep the path open to #453.

kjbracey avatar Mar 06 '19 14:03 kjbracey

Pulled in the more generic option name and description from #453, not flipped the default yet. Going to discuss with Sam.

Guess if it ends up off I just need to spend the time to add it to the Eclipse plugin GUI, so it's trivial to activate...

kjbracey avatar Mar 07 '19 10:03 kjbracey

There are a lot of things I'd like to add to the Eclipse plugin, to make it easier to set any of the declared user options, among other improvements. 😄

flit avatar Mar 07 '19 16:03 flit

I know that this is a bit outdated, but still very usefull change to PyOCD. Hugely improves GDB experience, so I recommend that you try to merge this in.

SeppoTakalo avatar Oct 09 '19 08:10 SeppoTakalo

Yep, this should be rebased and merged in.

JanneKiiskila avatar Oct 11 '19 12:10 JanneKiiskila