Symmetric multi-processing (SMP) support
Multi CPU support using platform appropriate primitives for parallel computing, such as web workers in the browser, or worker-threads in Node.
That would require a powerful setup
That would require a powerful setup
On the contrary, it should be feasible to emulate a multi-core CPU with SharedArrayBuffer, web workers/threads, and isolating state like the instruction pointer. Not sure how that looks yet with the jitter, whether multiple emulated CPUs could share a single instance, or if each CPU needs its own.
I'm not opposed to having SMP (especially in the form of a user-contributed PR), and it has been shown to be feasible by jor1k and others.
That said, I'd like to improve single-core performance first, since v86 is still fairly slow in that regard (compared to e.g. qemu-tcg).
@copy I'm quite interested in improving performance, do you have any insight as to where to start?
@copy I'm quite interested in improving performance, do you have any insight as to where to start?
Generally, find some OS and/or program you'd like to make run faster, profile it (https://profiler.firefox.com/docs/ for a sampling profiler, make debug-with-profiler produces stats at the bottom of debug.html (but runs much slower)).
You can also inspect the wasm (and the x86 asm) that v86 generates in dev tools or by changing this flag: https://github.com/copy/v86/blob/35f1f83/src/cpu.js#L1254
Some concrete ideas:
- Picking up https://github.com/copy/v86/pull/466. Generally, extending instruction analysis to record which instructions read or write registers and flags, which enables several optimisations like omitting flag generation (as in the aforementioned PR) and optimising
cmp+jmpmore often. Needs better fuzzing, which I'm working on. - Detecting function boundaries would enable improving control flow. At the moment all function entries and labels (switch targets/goto) are in a single big
br_table, which wasm jits aren't too happy about. This would also allow optimising call/ret into wasm calls (with some checks at runtime that the eip on the stack matches the expected return value; and a callstack limit). - Optimising vga memory writes using the existing memory fast path (e.g., allow using the fast path if the page is already marked dirty). See https://github.com/copy/v86/issues/301#issuecomment-1703971315
- Running cpu emulation in a web worker (#534, #219)
- Port pic+apic to Rust. Allows running more of the main loop in Rust (fewer switches between Rust/JS), and possibly some optimisations around mmio.
- Optimise access to stack memory. If the size of stack access (ESP) can be determined statically, then ESP only needs to be translated to physical once. Needs a fallback to slow path if the access crosses a page.