tock icon indicating copy to clipboard operation
tock copied to clipboard

Potential deadlocks if `kernel.process_blocked` is out of sync

Open jettr opened this issue 3 years ago • 4 comments

Many of the schedulers use the kernel.processes_blocked() function to determine if the scheduler should even look at the processes to schedule them. If processes_block is false, then the schedulers assume that at least one process will have work to perform. If the assumption is wrong, then the kernel may panic or dead lock.

The kernel provides work increment and decrement methods that processes use when they have pending work. Processes also have a ready() method that lets the kernel know if the process is ready to schedule or not. Each process must carefully increment and decrement work on the shared work counter for the kernel.processes_blocked() to be correct. As of now, the current code does this correctly.

This optimization does not seem worth the risk of getting it wrong. We can simplify assumptions and remove the increment/decrement work concepts along with the processes_blocked method. This will only affect the case when there are no processes to run, and the schedulers will loop through all processes once to see if there is any work. This is when the chip is about to go to sleep anyway, so this isn't really time the core needs to optimize for.

This is related to #3137 but taking a step back.

Thoughts?

jettr avatar Aug 11 '22 20:08 jettr

We can simplify assumptions and remove the increment/decrement work concepts along with the processes_blocked method. This will only affect the case when there are no processes to run, and the schedulers will loop through all processes once to see if there is any work. This is when the chip is about to go to sleep anyway, so this isn't really time the core needs to optimize for.

I looked through the scheduling code and agree pretty strongly with this conclusion. I think before, when we only had a priority scheduler, the kernel work counting represented a more significant optimization. But now it only saves time immediately before going to sleep, and that comes at the cost of having to increment and decrement the work counter in a whole bunch of other places throughout the kernel. I tried removing the kernel work counting and found that doing so saves 176 bytes and simplifies the scheduling logic nicely (https://github.com/hudson-ayers/tock/tree/no-work-count). This also has the side benefit of making any mismatch between the work counter and process.ready() counts impossible.

hudson-ayers avatar Aug 23 '22 20:08 hudson-ayers

I think I am likely in favor of this. Without looking at the actual assembly, the sleep path looks like it should be pretty lightweight:

for node in self.processes.iter() {
    Some(node.proc)?  // 3~5 cycles (one or two ldr's @ 2 + tst @ 1)
          proc.ready()?  // 20ish likely if not inlined
               Some([proc=self].tasks)? // 3~5
                     [tasks=ring_buf].has_elements()? // almost certainly inlined
                             [ring_buf=self].head==tail?  // 5~7

To be conservative, let's say 50 cycles per proc entry. Most boards have NUM_PROCS of 4-8, so we're talking order 200-400 cycle penalty on sleep. Which is mitigated some by the cycles won back from not doing state tracking.

While I would like to keep low-power operation in Tock's milieu, if my envelope math

i.e. Apollo3 is 6 uA/MHz @ 48 MHz (i.e. 288 µA active), and we're talking say 400 cycles @ 48 MHz (i.e. 8.3 µs) at 1.8V VCC is 4.3 nJ. Even 2000 cycles at 3.3 would still just be 2000*6*3.3 = 40 nJ

checks out, it's hard to make the energy case for this optimization on modern cores I think.

The worst-case scenario is probably not energy but timing. i.e., when an interrupt arrives that adds a task to the first process the scheduler has just dismissed as having no tasks, which when the scheduler resumes will iterate the rest of the array, try to sleep, fail at global_instance_calls_pending()?, and then re-enter the scheduler. Here that 8.3 µs (let's round up to 10 µs given the extra bits) will limit worst-case interrupt handling to 100 kHz.

Optimizing that seems like something we should look at when such interrupt processing actually proves itself to be a problem, but I would like to hear @phil-levis's thoughts here.

ppannuto avatar Aug 25 '22 17:08 ppannuto

If you look at the implementation I went with in #3158 , I just removed processes_blocked() entirely, and check whether to go to sleep as part of the operation of iterating through the processes to decide what to run next. So for your timing example, I think that a well-implemented scheduler could still address this case pretty effectively -- round_robin isn't perfect because the just-running process will be at the back of the queue, but I think MLFQ or the priority scheduler would observe basically no additional overhead from this since the just-yielded process would still be the next process checked for readiness.

hudson-ayers avatar Aug 25 '22 18:08 hudson-ayers

This will be fixed when #3158 lands. However, I'm waiting on AppId before merging #3158, because they touch a lot of similar code.

bradjc avatar Sep 12 '22 21:09 bradjc