firedancer
firedancer copied to clipboard
Support cooperative scheduling
Problem
Our tiles largely assume a static tile layout with correctly configured core affinity and process isolation. This is certainly the right architecture for a happy path production configuration.
While fd_tile supports a floating tile architecture, almost all threads make extensive use of spinlocks that don't address priority inversion. Therefore fixed tile operation is impractical. (Because there is no need to with fixed tile affinity).
There are still several scenarios where a floating tile architecture could make sense. To name a few:
- The startup sequence, during which we might do self-tests, or spawn ephemeral pipelines like snapshot loading and verification
- CLI tools running on resource constrained hosts (e.g. instrumented ad-hoc replay)
- Debugging
- Power saving jobs where CPU should downclock on low demand (fixed tile spinloop would consume 100% CPU time)
- Operators that don't have root (e.g. institutions with strict security policies)
- Support machines with a lower thread count than the minimum tile count a pipeline needs to run
I'm not saying we should decide to support any of these for sure. But I suggest we adapt these tiles to optionally run in a floating architecture just in case.
Suggested Changes
Instead of spinning indefinitely, optionally allow a tile to yield to the scheduler (kernel) and makes its scheduling dependencies explicit. Possibly using futex(2).