artiq
artiq copied to clipboard
Startup kernels should not be interruptable
One-Line Summary
The startup kernel should not be interruptable by other kernels.
Issue Details
Steps to Reproduce
- Flash core device with a startup kernel that takes an appreciable amount of time (e.g. waiting for all RTIO destinations to be up and initialising DDSes, …).
- Restart the core device.
- While the core device is booting (e.g. waiting for recovered clock), submit an experiment that launches a kernel.
Expected behavior
The experiment submitted on the master runs after the startup kernel.
Actual (undesired) behavior
The startup kernel gets interrupted by the experiment, leaving the hardware in a partially-initialized state.
@cjbe: I think most of the weird startup behaviour we have been seeing should be explained between this and the Urukul sync phase tuning issue on Alice.
I fundamentally agree that there is a need for something like "critical sections". Two things:
- Hung startup kernels: Is there a use case that needs to evict startup kernels that are hung for some reason (waiting for input, drtio link, programming error) by other means than restarting the core device?
- Critical sections in other kernels, e.g. idle kernel. Composite SPI transfers that are aborted by eviction would be a use case.
That was implemented in earlier ARTIQ versions by not opening the server socket in the runtime until the startup kernel has run, so the computer would get a "connection refused" error instead of interrupting the startup kernel. I guess that behavior got inadvertently changed later on.
by other means than restarting the core device?
In my experience, having a USB (JTAG/serial) connection to each core device is indispensable anyway, so I'm not too worried about this.
By definition, hung startup kernels only occur when (re)starting the device, so just power-cycling the device again should never be a problem for users.
This is fixed in the new firmware on Zynq.