operating-system
operating-system copied to clipboard
rcu_sched detected stalls on VirtualBox
Describe the issue you are experiencing
I had a problem after i installed second docker (miner) on my PC. After installing of the second docker my HASS didn't run any more. I trued to restore my VM image to early point but that did not help, so i update my VirtualBox to latest version and HASS was working. I then restore the HASS from my latest backup and updated HASS to 12.6/12.1/7.0 (before this latest version). After everything started working again i noticed that after a day or two i can't log into HASS by web. In the console i wound that it is frozen at "ha >" and have to do shutdown command and then gives me "rcu: INFO: rcu_sched detected stalls on CPU/tasks: " and a lot of other info, then i have to do another shutdown command to actually shut down the docker.
Strange thing is that after i start it again and go to log i can see that all the sensors and switches wore working wile i can't access the frontend.
i'm running HASS on VirtualBox 6.1.30 on Windows 10. core-2021.12.7 supervisor-2021.12.2 Home Assistant OS 7.1
So now i can't tell is it the HASS or VirtualBox problem. Can you help, please?
What operating system image do you use?
ova (for Virtual Machines)
What version of Home Assistant Operating System is installed?
7.1
Did you upgrade the Operating System.
Yes
Steps to reproduce the issue
- start the HASS
- Works for 1-2 days
- Freeze ...
Anything in the Supervisor logs that might be useful for us?
can't see older logs.
after i restart the HASS everything in the log is green except for 2 warning from Samba and Mosquito for unsupported commands.
Anything in the Host logs that might be useful for us?
can't see older logs.
System Health information
No response
Additional information
No response
"rcu: INFO: rcu_sched detected stalls on CPU/tasks: " and a lot of other info,
Can you post a screenshot of this info? Can you also post the system, storage and network settings?
It seems that the Linux kernel hangs on something (either disk read or network). In those situations its common that some part of the system still work while others don't.
I have the exact same thing
Today it also happened at start, during 'A start job is running for Docker Application Container Engine'. Now trying to get it working again. I don't know what happened, I tried to increase CPU cores, but that didn't help.
i fix my problem whit reinstalling VirtualBox to earlier version and restoring snapshoot to earlier backup from 12.2021 and than updating to newest version of VirtualBox and HA. Just reinstalling the VirtualBox to same version didn't help. It is obvious that this is VM problem, in mine case happened just after installing other VM Docker program, that interfered whit VirtualBox files or settings. I propose to do what i did. Just make sure you backup your last few snapshots, copy them to local drive. You can find more info on my reply in HA community post.
It is obvious that this is VM problem
What do you mean exactly by that, a problem of VirtualBox or the virtual machine image (HAOS ova)?
You can find more info on my reply in HA community post.
Which community post? Can you add a link to that post?
it happened again. i disconnected the audio device in VirtualBox and changed the chipset in System tab to ICH9. after that i had to select boot device in VM bios. 4-5days have passed and i have no problem.
Same issue here with latest versions of Home Assistant and VirtualBox :/
https://imgur.com/a/MEjxM01
Rebooting the VM sometime fix this sometime not
I have not seen it the last weeks, not sure what I did. I have updated all to latest. I did try some things with USB versions, disabling all items that have no relevancy (e.g. like above I removed the audio device I think). I also tested with amount of CPU's, but I don't know what exactly the cause it that it runs stable now.
Solved for me editing the VM and removing useless devices: floppy, optical, sounds, usb
I have been getting this issue lately as well. Plus some others. I was thinking my hard drive is going bad. I can sometimes get my HASSIO to boot and keep it up for most the day but then randomly this happens or many other issues I have a big list of screenshot errors happening. like right now I tried reseting my vm like 7 times and still no success turning it on yet.
I might want to try a fsck on my diff partitions but im afraid itll freeze up in the middle of me running fsck. I'll look at removing useless drives and audio stuff. I am also thinking about trying an older version of Virtualbox. or installing Virtualbox on a different SSD. I switched my VM from 1 HDD that was old to a diff SSD and i been getting the same issues.. so i'm confused about that happening.. Here are some other errors i received. blk_update_request: I/O error, dev sda, sector 3423235... Buffer I/O error on device sda8, logical block 1084234 systemd-coredump: failed to get COMM no such process CIFS: VFS: No username specified rcu: INFO: rcu_sched self-detected stall on CPU systemd-resolved.service: watchdog timeout rcy_sched kthread starved for 348343 jiffies! Failed to start Network Time Sync SQUASHFS error: Unable to read page, block 343423 size 7c6 EXT4-fs error (device sda8): ext4_journal)check_start:83 Detected aborted journal
and more than all these.. hopefully someone with more knowledge can point me in a direction to go. Thank you
There hasn't been any activity on this issue recently. To keep our backlog manageable we have to clean old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant OS version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
I have been running weeks without issue. I also reduced the amount of CPU's to just one. I did notice some speed issues, so I tried to increase to two CPU's, but this creates a more unstable environment. I assigned 2,5GB of memory and 2CPU's to Virtualbox, but it has a harder time starting than with one CPU and less memory.
In Windows I see high memory use (but not sure if this is just Virtualbox reservation), and I see continuous CPU usage. The CLI shows that it is stuck at starting hypervisor. After a reboot it works, but I do get more 'rcu_sched' notifications and got an unresponsive HA, which I haven't had for weeks.
Reverting back to 1 CPU solves the issues.
I have been running weeks without issue. I also reduced the amount of CPU's to just one. I did notice some speed issues, so I tried to increase to two CPU's, but this creates a more unstable environment. I assigned 2,5GB of memory and 2CPU's to Virtualbox, but it has a harder time starting than with one CPU and less memory.
In Windows I see high memory use (but not sure if this is just Virtualbox reservation), and I see continuous CPU usage. The CLI shows that it is stuck at starting hypervisor. After a reboot it works, but I do get more 'rcu_sched' notifications and got an unresponsive HA, which I haven't had for weeks.
Reverting back to 1 CPU solves the issues.
Same thing. Currently i'm running one cpu for 1 week and no stalls, if i add 2 cpu's i get stall in 5-10min after starting. This happened after update to 5.5, before that i got stalls whit 2 cpu's once a week. If you get stall most of the times you don't need to restart, just do PAUSE and than UNPAUSE or CTRL+P two times and it will continue, but it might stall again if the process is not finished.
I have read that in linux it is possible that the 2 cpu's can have different timings (not synced to each other) and that can cause big problems and stalls.
Interesting findings! I wonder if that is a known issue in upstream VirtualBox?
since the problem got much worse after upgrading HA i'm sure it is HA problem and not VB.
Yesterday I tried again with the latest and greatest versions of Virtualbox, HA, OS and all. Set it to two CPU's and I think within 30 minutes HA stalled, I got the rcu_sched notifications etc.
I'm not sure if the host hardware could be impacting this, or what is the cause, but it just doesn't work. Or maybe it is impossible to switch between cpu's and I should start fresh with 2 cpu's, but I haven't tested that yet.
Yesterday I tried again with the latest and greatest versions of Virtualbox, HA, OS and all. Set it to two CPU's and I think within 30 minutes HA stalled, I got the rcu_sched notifications etc.
I'm not sure if the host hardware could be impacting this, or what is the cause, but it just doesn't work. Or maybe it is impossible to switch between cpu's and I should start fresh with 2 cpu's, but I haven't tested that yet.
Maybe it is cpu related problem. Mine is AMD V1605B, what CPU do you have?
Intel Core i5-4690.
well one cpu no problems, but when i opened task manager i noticed that only cpu1&2 are used, others are parked. i hate loadning only one cpu so i unpark the rest of the cpu's, now all of them are utilized by a small amount. we will se how it goes but so far the results are very good. i used program unpark cpu.
I'm getting this same error: rcu_sched self-detected stall on cpu
VirtualBox: 6.1.34 r 150636 (Qt5.6.2) Home Assistant OS: 8.1 Home Assistant Core: 2022.6.1 CPU: Intel Core i7-5930k
I'm using latest version of virtualbox and HA OS 8.1. I had 2 CPUs dedicated to the VM. This would allow HA to run for up to a couple days before freezing. I upped it to 4 CPUs and that made the problem happen much quicker, at least from what I experienced.
After switching the VM to 1 CPU, I'm seeing no issues. This seems to be reproducible and consistent when using anything other than 1 CPU.
Maybe this tip works? https://www.virtualbox.org/ticket/20131#comment:2
I found that "perf top" was good at stalling it out a bit, and doing a "vboxmanage modifyvm foo --hpet on" on the host made the problem occur virtually never or not at all for that VM, even while every other VM without that change was stalling.
Maybe this tip works? https://www.virtualbox.org/ticket/20131#comment:2
I found that "perf top" was good at stalling it out a bit, and doing a "vboxmanage modifyvm foo --hpet on" on the host made the problem occur virtually never or not at all for that VM, even while every other VM without that change was stalling.
did that. it doesn't help.
Tried as well, also didn't work.
here is a video whit explanations regarding cpu stalls. i think what we are dealing is explained on 16:11
https://www.youtube.com/watch?v=23_GOr8Sz-E
It essentially means the kernel doesn't get to run on a particular CPU within a certain time limit. That can have different causes:
- The kernel itself locks up (on a particular CPU) due to some code blocking (this can be bad/buggy drivers, actual kernel bug, or broken hardware). Unlikely to be the issue here.
- The kernel actually doesn't get to run on a CPU time within a certain timespan, either caused by timer not executed or different expectation of a timer speed. Both causes are likely on Virtual Machines.
IMHO, this is a VirtualBox bug. Maybe VirtualBox needs a particular kernel config to run fine, but if that is the case, it should be documented somewhere. Also, HAOS kernel configuration is not really special, it mainly enables a lot of virtualization drivers. It works fine on other Virtual Machines as well, so :man_shrugging:
I saw in another post https://github.com/home-assistant/operating-system/issues/1737#issuecomment-1108843382 that you can change the paravirtualization settings and that some worked. I now am running on 'minimal' and so far it is running for a few hours, more than the test yesterday. I however also increased memory so it could still be that has some impact. I'll keep you updated.
So far up and running without issue since my previous post! Looks promising.
I noticed this week that I have been experiencing the same issue with a Home Assistant install I have running in VirtualBox on Windows 10. I may have been experiencing the issue for several weeks/months before this week, but a lot of variables on my end changed about a month ago (installed new networking hardware on computer, caught up on Windows 10 upgrades, reinstalled the latest version of Home Assistant VDI image), so I can't be certain before that point.
I'll try changing System > Accelleration > Paravirtualization interface to KVM per https://github.com/home-assistant/operating-system/issues/1737#issuecomment-1106855304 and report back.
Specs
PC: Edition Windows 10 Pro Version 21H2 OS build 19044.1706 Experience Windows Feature Experience Pack 120.2212.4170.0 Processor Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz 3.60 GHz Installed RAM 32.0 GB System type 64-bit operating system, x64-based processor Pen and touch No pen or touch input is available for this display
VirtualBox 6.1.34 r150636 (Qt5.6.2)
Home Assistant VM
- Image: haos_ova-7.6.vdi System
- Base Memory: 10240 MB
- EFI: Enabled
- Accelleration: VT-x/AMD-V, Nested Paging, KVM Paravirtualization Display
- Video Memory: 16 MB
- Graphics Controller: VMSVGA
- Remote Desktop Server: Disabled Audio
- Host Driver: Windows DirectSound
- Controller: Intel HD Audio Storage
- Controller: IDE
- IDE Secondary Device 0 [Optical Drive]: Empty
- Controller: SATA
- SATA Port 0: haos_ova-7.6.vdi (Normal, 32 GB) Network
- Adapter 1: Intel PRO/1000 MT Desktop (Bridged Adapter, Intel(R) Wi-Fi 6E AX210 160Mhz) Serial ports: Disabled USB:
- USB Controller: OHCI
- Device Filters: 1 (1 active)
@curtgrimes the comment below mentions KVM didn't work as I also believe that is equal to default. However, try it!
Mine is now stable with 2 cpus with the minimal setting!
one cpu, but unparked all cores.
VB is sharing all cpu's whit windows.
in the time of snapshot i'm updating HA and nothing else is using the cpu.
i'm using this method for more than a week with not isuess so far.
I tested changing the para-virtualization setting to minimal, and although it lasted longer before freezing, it froze anyway. One curious thing I discovered is that if you enable the virtual keyboard from Virtual Box (Input > Keyboard) and type something, the machine unfreezes, but that's not good either way.
I also noticed that before I saw the rcu errors on the VM screen but now it freezes without telling me anything, I'm quite sure, the issue is still the rcu_sched warnings though
I reduced the number of CPUs to 1 to see if that helps.