uhyve
uhyve copied to clipboard
Do all threads have to terminate before the main thread terminates?
We should continue the discussion #68 here.
I would vote for "it is the kernels responsibility to make sure that all threads terminate correctly and exit the VM". I haven't put much thought to it, but it seems much more straight forward. For example, the kernel may still have some work to do e.g. flushing buffer and may split that task among all cores. The current design forces the kernel to synchronize the cores and make sure that no exit code is provided until everything is done. Furthermore the hypervisor needs additional code to handle that case. But than again, I haven't thought to much on how hard it is to ensure that all cores exit the VM. 🤷♂️
I also think that the kernel should properly offline all cores when shutting down. If the kernel doesn't shut down all cores, then that should be treated as an error (as long as we don't support dynamic hotplugging). My proposal would be for the following behavior:
- If one thread exits with
Err, then I think we can assume that the kernel crashed and can exit with an error, without waiting for any other threads. - If one thread exits normally, we should set up an alarm (or some other timer) to give the kernel some time to properly shut down all vCPUs.
- If all other threads terminate before the alarm goes off, then the kernel exited successfully. The exit code needs some special attention here since each vCPU returns an option
. I'd prefer if we explicitly check that a) only one vCPU returns an exit code, and b) it is the last vCPU that returns the exit code. - If at least one thread fails to terminate before the alarm handler gets invoked, then exit uhyve with an error code and a warning.
- If all other threads terminate before the alarm goes off, then the kernel exited successfully. The exit code needs some special attention here since each vCPU returns an option
* If one thread exits normally, we should set up an alarm (or some other timer) to give the kernel some time to properly shut down all vCPUs. * If all other threads terminate before the alarm goes off, then the kernel exited successfully. The exit code needs some special attention here since each vCPU returns an option. I'd prefer if we explicitly check that a) only one vCPU returns an exit code, and b) it is the last vCPU that returns the exit code. * If at least one thread fails to terminate before the alarm handler gets invoked, then exit uhyve with an error code and a warning.
I agree that this is probably the best solution, but I am wondering if that is worth the effort given that we need to maintain a testcase for it and find a reasonable value on how long a kernel is allowed keep some cores online. I would only implement such bug mitigation for a daemon. For a simple command line tool it seems fine to deadlock as I can still kill it easily.
For a simple command line tool it seems fine to deadlock as I can still kill it easily.
My main motivation is to improve the usability of uhyve for CI. If uhyve simply deadlocks, then the CI jobs will still be marked as "running" although they actually have already failed, which delays the feedback.
find a reasonable value on how long a kernel is allowed keep some cores online
For this purpose a timeout in the order of seconds would be perfectly fine, and I would be surprised if that wouldn't be enough for the kernel to shutdown the other threads regularly (but I may be wrong about that).
For a simple command line tool it seems fine to deadlock as I can still kill it easily.
My main motivation is to improve the usability of uhyve for CI. If uhyve simply deadlocks, then the CI jobs will still be marked as "running" although they actually have already failed, which delays the feedback.
I see. In that case, would a timeout for the individual tests not make more sense? After all, there are other bugs that could cause a deadlock.
Timeouts for the tests have a larger overhead/safety margin since the time for the test itself also has to be considered.
After all, there are other bugs that could cause a deadlock. If there is a deadlock in the kernel itself, then this, of course has to be caught with the existing test timeout.
Thinking about it again, I think the main issue I had, is that previously the kernel would not exit if one thread paniced in kernel mode (I remember we addressed that and now exit after Err).
So I guess you are right and adding additional timeout handling for non Err return values is probably just adding unnecessary complexity.
Timeouts for the tests have a larger overhead/safety margin since the time for the test itself also has to be considered.
After all, there are other bugs that could cause a deadlock. If there is a deadlock in the kernel itself, then this, of course has to be caught with the existing test timeout.
Thinking about it again, I think the main issue I had, is that previously the kernel would not exit if one thread
paniced in kernel mode (I remember we addressed that and now exit afterErr). So I guess you are right and adding additional timeout handling for nonErrreturn values is probably just adding unnecessary complexity.
I just noticed that I have not explicitly stated my motivation why I am skeptical of adding the timer: The hypervisors is security sensitive, which is why I am just worried that we are adding unnecessary code. If we can handle it in the CI than I think we should do it in the CI. That doesn't mean we should not add it to uhyve if it turns out to be necessary.