flor
flor copied to clipboard
Fault Tolerance
- Resume training on crashes.
- Write valid logs when possible, even during failures (exception handling).
- Reconstruct logs on sudden failures.
The goal is to enable resuming a crashed execution, and to enable hindsight logging over a crashed execution and partial/invalid MEMO.json file.