RIOT core/debug: printf what asked to print

Contribution description

removes the unhelpful stack-size-test from debug-print

Testing procedure

Issues/PRs references

this is much more sane than #20166

Dec 12 '23 08:12 kfessel

Murdock results

:heavy_check_mark: PASSED

b39fbe23e389e581b9c5e76fd082e0c33ce1b317 core/debug: printf what asked to print

Success	Failures	Total	Runtime
8082	0	8082	10m:54s

Artifacts

Documentation preview

Dec 12 '23 09:12 riot-ci

Can I haz an explanation why this check is no longer needed?

Dec 12 '23 20:12 OlegHahm

With a switch to picolibc we wouldn't get stack overflows on printf() that easily.

With newlib, this can still happen. We do have the MPU based stack overflow and the heuristic in place in addition, so that stack overflows should at least get caught. And an "the debug info you wanted to see cannot be shown until you increase stack size" and an "stack overflown" both have the same result: Bump the stack size and flash again. So there isn't much lost even when using newlib.

Dec 12 '23 20:12 maribu

With a switch to picolibc we wouldn't get stack overflows on printf() that easily.

Not that easily but still possible, right? And as far as I understood newlib is still the default or am I wrong?

With newlib, this can still happen. We do have the MPU based stack overflow and the heuristic in place in addition, so that stack overflows should at least get caught.

On how many platforms do we have MPU support? And what is the "heuristic" one?

Dec 12 '23 20:12 OlegHahm

Most Cortex M boards have an MPU, but for some we don't use it to due bugs, if I recall correctly.

Some RISC-V MCUs also have an MPU (they call it differently, though). I'm not sure if we use that to guard the stack, though. @teufelchen should know.

The heuristic is the THREAD_CREATE_STACKTEST that fills the stack with canary values. (The address of a memory location is used as canary for that location.) The heuristic can fail if despite the stack overflow the canary is still there (hitting the value by chance is 1 to 2^32-1 on 32 bit systems). The heuristic in practice works pretty well, but it is enavled by default only with develhelp, if I recall correctly.

Dec 12 '23 21:12 maribu

As far as I recall these MPUs (or at least some of them) only allow to protect a limited number of memory segments, i.e., cannot be used for arbitrary stack protection.

Regarding STACKTEST: that doesn't provide any stack protection but only shows stack usage. This feature already was in place when this macro was introduced but didn't help that much.

I totally agree that this macro hack is ugly as fuck but was introduced for a reason and I'm not convinced that reason has vanished.

Dec 13 '23 07:12 OlegHahm

@OlegHahm:

https://github.com/RIOT-OS/RIOT/blob/master/core/sched.c#L126-L139

Dec 13 '23 07:12 maribu

The heuristic can fail if despite the stack overflow the canary is still there (hitting the value by chance is 1 to 2^32-1 on 32 bit systems). The heuristic in practice works pretty well, but it is enavled by default only with develhelp, if I recall correctly.

Actually the chance of overflowing the stack without hitting the test is much higher, as sth might just allocate a huge buffer on stack, just modifying the SP, but not writing to it. the test only covers changing the exact canary value, not anything below.

Dec 13 '23 08:12 kaspar030

As far as I recall these MPUs (or at least some of them) only allow to protect a limited number of memory segments, i.e., cannot be used for arbitrary stack protection.

What we do here is to also switch segments when task switching. I'm not sure though whether we have a segment for the ISR stack.

Dec 13 '23 08:12 kaspar030

Actually the chance of overflowing the stack without hitting the test is much higher, as sth might just allocate a huge buffer on stack, just modifying the SP, but not writing to it.

That is pretty like one of the vectors an attacker would try; it would also work when "jumping over" the MPU guard space.

The Linux kernel uses a similar mechanism to detect stack overflows, I think it has at least two unmapped pages after the stack ("after" here in terms of the direction the stack grows) in the virtual address, causing segfaults on most stack overflows. But jumping of the guard space is possible there as well. I think GCC can be told to touch at least one byte in every page it allocates on the stack to prevent this attack vector, but it requires binaries to be compiled with that magic. We might want to enable that as well (but with MPU guard area size instead of page size)?

Dec 13 '23 08:12 maribu

I also experimented with enlarging the canary value to a proper redzone (bringing non-mpu checks closer to what the mpu does), but never got to PR that. this is the branch.

Dec 13 '23 09:12 kaspar030

I like the idea, but I don't like zero being the magic number there. I think e.g. GCC when asked to write at least one byte to every page of stack allocation during the allocation will write zero to it. The stacktest approach would also make it more difficult to correctly guess the correct magic value, as it depends on memory layout. (But then again, the memory layout is reproducible and static within the same firmware.)

Maybe we could also just check one word on context switch, and have the idle thread check the red zones of all stacks before going to sleep? Or is the context switch overhead not as bad as I assume?

Dec 13 '23 09:12 maribu

I like the idea, but I don't like zero being the magic number there.

I guess re-using the stack test approach makes sense.

Or is the context switch overhead not as bad as I assume?

TBH, I don't remember. I worked on that at the side like, three years ago. I'd assume the overhead to be prohibitive for production, though it might not be that bad. (assuming a word wise memory compare takes a cycle, in theory we're looking at only like, +10-20 cycles per context switch, which would be <5-10% overhead.)

Dec 13 '23 09:12 kaspar030

Regarding this PR: the reason for this check was that enabling the debug macro may easily result in a stack overflow because of newlib's high stack usage for printf. In that case I would assume that the chances are actually not that bad that the scheduler stack test in place would catch these cases. Hence, the check may be obsolete indeed.

So, how about removing this check and see if we notice that people start to run into this type of problem on a regular basis, we can still revert the change.

Dec 13 '23 10:12 OlegHahm

As stated above: from my perspective, let's give it a try.

Oct 29 '24 20:10 OlegHahm