v86
v86 copied to clipboard
Kernel hang with CONFIG_DEBUG_NMI_SELFTEST (Alpine Linux)
When booting Alpine Linux in v86 (default kernel compiled with CONFIG_DEBUG_NMI_SELFTEST=y), the kernel softlocks during nmi_selftest:
[ 1.350333] smpboot: weird, boot CPU (#0) not listed by the BIOS
[ 1.350333] smpboot: SMP disabled
[ 1.363333] Performance Events: unsupported Netburst CPU model 6 no PMU driver, software events only.
[ 1.365666] rcu: Hierarchical SRCU implementation.
[ 1.375666] NMI watchdog: Perf NMI watchdog permanently disabled
[ 1.378999] smp: Bringing up secondary CPUs ...
[ 1.380333] smp: Brought up 1 node, 1 CPU
[ 1.380333] smpboot: Max logical packages: 2
[ 1.380333] ----------------
[ 1.380333] | NMI testsuite:
[ 1.380333] --------------------
[ 1.380333] remote IPI: ok |
[ 1.380333] local IPI:
[ 29.373333] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]
[ 29.373666] Modules linked in:
[ 29.373666] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.86-0-lts #1-Alpine
[ 29.373666] EIP: test_nmi_ipi.constprop.0+0x99/0xe0
[ 29.373666] Code: ed 87 ff 83 f8 40 75 09 89 d8 35 8e 02 f6 0e eb 1b 8d 83 1b be 94 ec 4e 74 12 b8 c7 10 00 00 81 eb 54 1a 54 50 e8 0c cb 8e ff <eb> c5 35 09 99 49 6c ba 9d 43 96 c6 89 c3 31 c0 e8 17 20 4f ff 8d
[ 29.373666] EAX: 00000000 EBX: 83179bd0 ECX: 00000000 EDX: 00000000
[ 29.373666] ESI: 000ed549 EDI: c6b90c04 EBP: c10f3f2c ESP: c10f3f24
[ 29.373666] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00000246
[ 29.373666] CR0: 80050033 CR2: ff9ba000 CR3: 06c0c000 CR4: 00000690
[ 29.373666] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 29.373666] DR6: 00000000 DR7: 00000000
[ 29.373666] Call Trace:
[ 29.373666] local_ipi+0x32/0x4f
[ 29.373666] dotest.constprop.0+0x11/0xbb
[ 29.373666] nmi_selftest+0x83/0x1a3
[ 29.373666] native_smp_cpus_done+0x2a/0xc7
[ 29.373666] smp_init+0x7b/0x94
[ 29.373666] kernel_init_freeable+0x13f/0x28e
[ 29.373666] ? rest_init+0xb0/0xb0
[ 29.373666] kernel_init+0x17/0xf0
[ 29.373666] ret_from_fork+0x1c/0x30
The test is supposed to timeout, but occasionally the watchdog shows that it's hitting udelay:
[ 113.373666] EIP: delay_tsc+0x41/0xa0
[ 113.373666] Code: 31 bf c6 89 45 f0 0f ae e8 0f 31 89 45 e0 89 55 e4 eb 17 8d b6 00 00 00 00 f3 90 64 8b 1d 3c 31 bf c6 39 5d f0 75 32 89 5d f0 <0f> ae e8 0f 31 8b 4d e8 8b 5d ec 89 45 d8 89 55 dc 2b 45 e0 1b 55
[ 113.373666] EAX: ffffffff EBX: 00000000 ECX: 000003e9 EDX: 00000000
[ 113.373666] ESI: 000d8da8 EDI: c6b90c04 EBP: c10f3f14 ESP: c10f3eec
[ 113.373666] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00000246
[ 113.373666] CR0: 80050033 CR2: ff9ba000 CR3: 06c0c000 CR4: 00000690
[ 113.373666] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 113.373666] DR6: 00000000 DR7: 00000000
[ 113.373666] Call Trace:
[ 113.373666] __const_udelay+0x31/0x40
[ 113.373666] test_nmi_ipi.constprop.0+0x99/0xe0
[ 113.373666] local_ipi+0x32/0x4f
[ 113.373666] dotest.constprop.0+0x11/0xbb
[ 113.373666] nmi_selftest+0x83/0x1a3
In v86, IOAPIC_DELIVERY_NMI is stubbed:
https://github.com/copy/v86/blob/17a6b3b4e942ed04eb22098023d4b694f97020c0/src/apic.js#L436-L440
So it seems like a layered issue:
- Kernel probably shouldn't be calling nmi_selftest when SMP is disabled
- The test timeout isn't working, for some reason
IOAPIC_DELIVERY_NMIshould probably be implemented (at least trivially)
The first version of Alpine Linux (3.0.0) works fine in copy.sh/v86

It's any kernel built with CONFIG_DEBUG_NMI_SELFTEST=y. Nothing to do with Alpine Linux in particular, other than they added that kernel config sometime within the last few years.
Just out of curiosity can you share the latest version of alpine compiled with CONFIG_DEBUG_NMI_SELFTEST=n
As far as I can tell the last version that currently works is 3.5.3 which is super old (slightly newer versions seem to have a different problem, then all versions start to have this issue). How much work will it be to implement this so versions of Alpine Linux from within the last 5 to 6 years can run in v86? I think the alternative is compiling a custom kernel while running Alpine in QEMU and that seems like a lengthy process that requires a very large image size. It would be really nice to have a useful OS to run in v86. @copy has suggested using Alpine, but how are we supposed to do that?
The Arch Linux profile on copy.sh/v86 is usable to some extent
From: edwillard @.> Sent: Monday, February 12, 2024 5:02 PM To: copy/v86 @.> Cc: Ryan @.>; Comment @.> Subject: Re: [copy/v86] Kernel hang with CONFIG_DEBUG_NMI_SELFTEST (Alpine Linux) (Issue #797)
As far as I can tell the last version that currently works is 3.5.3 which is super old (slightly newer versions seem to have a different problem, then all versions start to have this issue). How much work will it be to implement this so versions of Alpine Linux from within the last 5 to 6 years can run in v86? I think the alternative is compiling a custom kernel while running Alpine in QEMU and that seems like a lengthy process that requires a very large image size. It would be really nice to have a useful OS to run in v86. @copyhttps://github.com/copy has suggested using Alpinehttps://github.com/copy/v86/issues/972, but how are we supposed to do that?
— Reply to this email directly, view it on GitHubhttps://github.com/copy/v86/issues/797#issuecomment-1938990689, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AQQXXR6PAFNWIGTCGFXKUH3YTI4LFAVCNFSM6AAAAAAUGOQEV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZYHE4TANRYHE. You are receiving this because you commented.Message ID: @.***>
@spetterman66, thank you for your suggestion. Arch 32-bit works well enough but it lacks current versions of most popular developer tooling packages (at least that I checked) like Node and NPM. Then it runs into various issues when trying to install popular NPM-based tooling like CRA.
I tried creating a custom image of Arch with QEMU (so I could try building Node from source) but got stuck at the keyboard support part.
I think v86 is fast enough that it could be used for some pretty cool things but not without a useful OS (for a developer). Sorry if this is getting off topic but I am more than willing to help out in the effort. I just need to know where to best start.
@edwillard In what sense does this bug prevent you from using Alpine on v86? From what I can tell, it doesn't cause any problems besides a 5 second delay in the boot process. Besides that, Alpine seems to work fine.
When I create an image using the latest Alpine Virtual 3.19.1 it outputs a message similar to "watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]" from the OP over and over with the seconds number growing each time and never gets to anything usable. Maybe I missed something?
@spetterman66 also seemed to confirm that only old versions work in an earlier message...
To be thorough I also tried Alpine Standard 3.19.1 and it has the same issue. I can successfully use Alpine versions 3.5.3 and older, quickly and without any problem. Current versions never get past the soft lockup messages even if left to run for half an hour.
Are you using http://copy.sh? Try https://copy.sh or localhost. Timer resolution is reduced on non-https (see https://developer.mozilla.org/en-US/docs/Web/API/Performance/now#security_requirements)
I use localhost but have tried https://copy.sh/ (never http) and it has the same issue in both cases. Although your point about timer resolution made me consider that it could be a Firefox (included by default on my fresh install of Ubuntu) issue or something related to Firefox settings so I installed Chrome and it does indeed work in Chrome. It would be nice to get it working in Firefox too if anyone has any ideas. I can provide an update here or open a new issue if I find anything relevant. Thank you very much for your support and excellent work on v86.
Edit: Just to note that Alpine versions 3.5.3 and below do work with Firefox regardless.
To be thorough I tried on localhost with a fresh install of Firefox Developer Edition and it has the same issue. It seems like it must be something from the combination of this bug and Firefox. I would like to test on Safari as well when I have a chance and will update on the results if no one else gets to that sooner.
I was able to get on a Mac sooner than expected and it works in Safari and Chrome but not in Firefox. Not sure if I can get Windows set up and test there too but my guess would be that Chrome works but not Firefox. Happy to help with Firefox however I can just let me know.
You can run this code to determine the resolution of performance.now(): x = new Set(); while(x.size < 5) x.add(performance.now()); Array.from(x).sort((a,b) => a-b).slice(-2).reduce((a,b) => b-a)
On Firefex (developer edition), I get 1ms on http sites and 0.02ms on https (with COOP/COEP).
The NMI test suite calls udelay 1000000 times: https://github.com/torvalds/linux/blob/28b8235/arch/x86/kernel/nmi_selftest.c#L82-L83
udelay on x86 is here: https://github.com/torvalds/linux/blob/28b8235/arch/x86/lib/delay.c#L207
udelay will repeatedly call rdtsc until it changes (some amount that should amount to 1us), but with the reduced resolution this will take (worst case) 16 minutes (1ms * 1000000) on Firefox on http and 20 seconds on https (0.02ms * 1000000). I can experimentally confirm the 20 second delay.
On Chromium, the resolution is higher (0.1ms and 0.005ms respectively). Maybe your Firefox has a different resolution or doesn't accept the COOP/COEP headers on https://copy.sh?
Now regarding fixes, there are a couple of options:
- Implement NMIs in the apic (so the test passes instead of timing out). Currently apic support is experimental and doesn't boot Alpine at all. In any case, it would be nice if the timer issue could be fixed properly, as other OSes might also be affected (I believe KolibriOS has a similar loop in its boot process).
- Do nothing and wait until Firefox gets a better timer resolution
- Hack rdtsc to return higher values if called multiple times and the browser time hasn't changed. There is some prototype that was enabled a long time ago (see https://github.com/copy/v86/blob/bfb59f9/src/rust/cpu/cpu.rs#L3972). However, this would mess with other uses of rdtsc. Maybe behind a flag.
- Implement tpause (although it's not entirely clear how, if the timer resolution sucks)
- Change cpuid flags so that Linux stops using the tsc and uses the default delay_loop instead. In my experiments, setting
cpuid_levelto 0x14 makes Linux try to calibrate the tsc against the pic and fail ("Marking TSC unstable due to could not calculate TSC khz"), at least on Firefox where the timer resolution is 1ms - Compile a kernel without the nmi selftest
Or, since this is not really v86's fault:
- Ask Firefox developers to improve their timer resolution (do two browsers really need four different values?)
- Send a patch to Alpine to disable the NMI selftest
- Send a patch to Linux to only run the NMI selftest on systems that have an apic
@copy, thank you so much for your detailed investigation, this is great and very helpful.
I found that running the provided code to determine the timer resolution in my standard Firefox resulted in almost 17ms regardless of localhost/HTTP/HTTPS! Based on your example that might be expected to take 4 to 5 hours instead of 16 to 17 minutes!
I thought the almost 17ms value suspiciously matches up nearly perfectly with the "16.67ms frame budget" so it has probably been tuned with that in mind.
After finding that Firefox Developer Edition behaves as you stated (1ms and 0.02ms) I figured out that setting the config value (by going to about:config) for privacy.resistFingerprinting to true (default false) was responsible for the ~16.67ms value (note that the browser needs to be restarted after changing this for the timer resolution to be changed).
With privacy.resistFingerprinting set to false I see a timer resolution of around 0.02ms on https://copy.sh/v86/ but other HTTPS sites and localhost are still 1ms which I suppose is due to the lack of COOP/COEP headers but have not looked into that deeper.
I did find that setting the config value for privacy.reduceTimerPrecision to false (default true) makes everything (localhost/HTTP/HTTPS) 0.02ms and also that setting the config value for privacy.reduceTimerPrecision.unconditional to false (default true) in addition to privacy.reduceTimerPrecision makes everything below 0.001ms, often below 0.0005ms (note that the browser does not need to be restarted after changing these for the timer resolution to be changed).
Non-default config settings are difficult to require of users so I think the quickest workaround to the issue for Firefox support is to either wait about 17 minutes for Alpine (or any other distro that has the same problem) to finish booting (assuming privacy.resistFingerprinting set to false), or change the config settings in Firefox, or just load it in Chrome, then save the state and simply load that which nicely bypasses the issue in Firefox (even if the user has privacy.resistFingerprinting set to true).
I plan on digging into everything you wrote more (if only for my own education) and will update with any additional results. Thanks again for your help, I really appreciate it. Hopefully this helps any other Firefox users that are having the same trouble.
Edit: Just to note that I think the reasoning for the timer resolution decisions in Firefox has to do with Spectre and Meltdown mitigations so it might be undesirable to set privacy.reduceTimerPrecision and privacy.reduceTimerPrecision.unconditional to false in general unless someone more knowledgeable says otherwise.
I found that the timer resolution can have a large impact on overall performance. Taking the example of running npm run build: a resolution of ~0.0005ms or 0.02ms seems to perform equivalently, while a resolution of 1ms takes about 2/3 longer for the command to finish, and a resolution of ~16.67ms makes running the command impractical. I would like to learn more about why this is the case. Any ideas (CPU or IO)? Whatever the specifics, it would be advisable for anyone using v86 in production to include a timer resolution test and alert the user of degraded performance if the resolution is above 1ms.
It's used roughly three places:
- rdtsc returns it directly (multiplied by
TSC_RATE): https://github.com/copy/v86/blob/21cf9ad/src/rust/cpu/cpu.rs#L3966 - main_loop uses it to yield to the browser (for redrawing and receiving mouse/keyboard/fetch events) every 1ms: https://github.com/copy/v86/blob/21cf9ad/src/rust/cpu/cpu.rs#L3056 This could explain why 16ms is much slower, but not why 1ms is slower. We could switch to yielding after a fixed number of cycles (at least as an option). Note that every yield has a fair overhead (there's no good browser api, see https://github.com/copy/v86/blob/21cf9ad/src/main.js#L98) and cycles per second aren't stable
- To run the pic and rtc. This condition is probably never going to return true on 16ms (and possibly on 1ms), but I don't think it's used by Linux. The pit typically runs at 10ms (configurable by the guest), so that doesn't really explain any performance differences. The logic looks fine to me, even for 16ms.
setTimeout might also be affected. v86 uses setTimeout instead of postMessage when the cpu is executing HLT and there are no timer interrupts coming up. If setTimeout is throttled to 16ms, it may sleep much longer than intended.
To narrow it down, I'd suggest:
- Replacing
microtickwith a throttled version (function throttled_microtick() { return 16.67 * Math.round(microtick() / 16.67); }) in selected places - Disabling idling by commenting out
setTimeouthere
- Compile a kernel without the nmi selftest
I recompiled Alpine Linux edge kernel without CONFIG_DEBUG_NMI_SELFTEST (virt: iso, package, lts: package), and it's worked in Firefox 122.0.1 (Windows) on non-https copy.sh/v86:
Notes:
- To install kernel to already installed system:
apk add --allow-untrusted ./linux-<virt/lts>-6.6.16-r0.apk - Networking (ne2k-pci) working only on linux-lts, because linux-virt don't have drivers
- Instruction for building custom kernel: https://wiki.alpinelinux.org/wiki/Custom_Kernel (compiles lts and virt once together, needed installed Alpine Linux and 20GB+ storage)
@edwillard I don't have a fix for the nmi selftest yet, but I pushed a fix for slow disk IO into the wip branch (dfacd61). Please test and let me know if this fixes (some of) your performance issues.
I pushed a fix for the nmi selftest (especially with privacy.resistFingerprinting) to the wip branch. It might also help performance in some other cases where programs sleep waiting for (small) rdtsc changes.
I pushed a better fix in https://github.com/copy/v86/commit/e644f89d7a5f818037a49f9597ca5bd2e0d0c369 (in the wip branch)
- Compile a kernel without the nmi selftest
I recompiled Alpine Linux edge kernel without
CONFIG_DEBUG_NMI_SELFTEST(virt: iso, package, lts: package), and it's worked in Firefox 122.0.1 (Windows) on non-https copy.sh/v86:
Notes:
* To install kernel to already installed system: `apk add --allow-untrusted ./linux-<virt/lts>-6.6.16-r0.apk` * Networking (ne2k-pci) and sound (sb16) working only on **linux-lts**, because **linux-virt** don't have drivers * Instruction for building custom kernel: https://wiki.alpinelinux.org/wiki/Custom_Kernel (compiles lts and virt once together, needed installed Alpine Linux and 20GB+ storage)
Did you enable SB16 support when compiling this? My installation of Alpine doesn't seem to detect it so just wondering if installing this kernel would help with that
Can you try apk add linux-firmware-sb16 and https://github.com/copy/v86/issues/616#issuecomment-1035658629?
I don’t seem to have a module called snd-sb16. I have snd-sb-common and snd-sb16-dsp, but no snd-sb16. Installing linux-firmware-sb16 doesn’t add the missing module, either.
Sorry, I have confused with sb16 and sb16-dsp
I have no luck with compiling just snd-sb16 module without whole kernel build (I tried make M=sound/isa/sb)
That’s okay. I’ve created an issue on the GitLab page for Alpine requesting for the CONFIG_SND_SB16 flag to be enabled on linux-lts for x86 Alpine Linux v3.20, which would include the module in all future linux-lts package releases. As long as the issue is deemed worthy of a fix by the developers, sound on Alpine might soon be fixed for everyone :)
Great news! linux-virt should now have the necessary network and sound drivers: https://gitlab.alpinelinux.org/alpine/aports/-/commit/af54cf99c3ace522d51b0aac1681e6edfbb56098
