ignite
ignite copied to clipboard
Host deadlock under load
I have an ignite host that's routinely hanging under load. Host becomes completely unresponsive, doesn't ping, SSH connections freeze and time out, no messages on ipmi console. Ubuntu 20.04. AMD EPYC CPU, 24 cores (I think) with 128GB RAM. Bare metal obviously.
Workload is rapidly starting and stopping many ignite VMs. I've added a mutex so that ignite run
never runs concurrently with itself or ignite rm
. Has anyone seen anything like this?
How do you advise to diagnose the issue? I'd enable kdump but the kernel never panics, just hangs...
Just noting from our previous convo in slack a few weeks ago:
i also made the system which is controlling ignite less aggressive and haven't seen any crashes yet - if i get no crashes on this server with the code changes, i'll see if it still crashes on the old server
Reducing the concurrent vm startup or runtime seems to have resolved some of this deadlock problem