Add support for AssemblyScript
Hey Uri, We have already spoken about the possibilities to speed up the simulation. I am interested in adding support for AssemblyScript. I followed the recent discussions and speed ups. The question is if you still think it can bring some performance gains, also after the enhancements that were made?
That's a good question.!
One challenge with AssemblyScript is that the code can no longer be easily extended or modified from the JavaScript realm, and communication between JS and the AssemblyScript code can be quite costly, so we'll want to keep it down to minimum. For starters, I'd suggest trying to create a version of the demo project that works with AssemblyScript, so we can compare the performance.
If AssemblyScript does bring a notable performance improvement that can't be achieved by tuning the TS code, we can probably find a way to include an AssemblyScript binary in the releases, either in this repo or a dedicated repo. Once we have more data about the performance we can consider the best course of action.
This sounds like a good plan. I will take a look at it!
Thanks!
Another idea that I had would be to come up with a AVR → WebAssembly compiler, that is to convert the raw AVR binary into WebAssembly code that does the same, so we don't have to pay the overhead of decoding the each instruction as the program is executing, and perhaps the JIT will be able to do a better job at optimizing the generated code.
Oh wow. This sounds really interesting! The question would be, how to integrate the peripherals.
That's a good question. I'd imagine having a bitmap or so that will indicate which memory addresses are mapped to peripherals. Whenever you update a memory address, you'd check in the bitmap. If it has a peripheral mapped to it, then you'd call a web-assembly function that will resemble the writeData() function that we currently have....
That should work. The peripherals will be in WebAssembly too? In this case, everything would run in WebAssembly, expect the visuals which have to stay in JavaScript. But then we have again the problem, that all peripherals have to be converted to WebAssembly :/.
We could also mix-and-match. I believe stuff like the timers, which has to run constantly (in some cases after every CPU instruction or so), will have to live in the WebAssembly land. But some other, less frequent peripherals could possibly bridged over to JavaScript land.
There's definitely much to explore here...
The thing is, it always depends on the workload. So yes it is really interesting, but we need to start at some point with the exploring. Maybe we will see any point where we can say, we are already faster than ever expected. On the whole, you can say the CPU which is simulating will be always a lot faster than the simulated MCU. The gain of optimization is still for slow end devices.
So, should we create a plan where to start with testing and exploring the possibilities each approach has?
Yes, and as you say, it's good to have some baseline to compare to.
Right now, the JS simulation has some thing that can already be improved (e.g. the lookup table for instructions), and it runs pretty okay on modern hardware - achieving between 50% speed on middle-range mobile phones and 160%+ speed on higher-end laptops.
However, lower end devices (such as Raspberry Pi) only achieve simulation speed of 5% to 10%.
So there is definitely room for improvement, especially if we consider the use-case of simulating more than one Arduino board at the same time (e.g. two boards communicating with eachother)
Ok yes. For some playgrounds etc. it would be really funny and interesting to have multiple Boards running at once. So only if this case will be locked to higher-end devices, we need to improve it. I also had some interesting ideas with simulating such boards with node.js and other JavaScript runtimes.
I am actually not familiar with the benchmark code. Is the benchmark possible to run under all mentioned approaches or do we need a new benchmark to make a meaningful comparison?
The current benchmark is pretty minimal - it runs compares a single-instruction program many many times to compare different approaches for decoding.
I think a better benchmark would need:
- A larger program, that runs for a specific amount of CPU cycles, and includes a variety of instructions, branches, etc.
- A more extensive benchmark program, that uses peripherals, in addition to just running code, so we can check the integration of everything together.
- If we want to run it in Web Assembly, we'd also need a runner in Web Assembly, though that shouldn't be to complicated.
I'd probably start with just the 1st, simpler benchmark, to get a feeling if the direction seem promising, and if it is, then we can devise a more extensive benchmark that will allow us to do a comprehensive comparison.
What do you think?
Starting with 1st should be the best approach. I would start looking at AssemblyScript or at WebAssembly directly? With WebAssembly directly we also have the decision between AVR Instruction translation and full interpretation (like JavaScript) in WebAssembly.
Maybe we can focus on some most needed Assembler instructions to reduce the amount of initial instructions and to focus the benchmark on those. So we can faster see first results and decide after, where we dig deeper or if we can already see a clear winner?
I believe that Web Assembly interpretation (written in C or RUST) wouldn't be much different than AssemblyScript, but it's pretty easy to write one or two instructions, as you suggest, and compare the generate WAT (Web Assembly Text) between the different implementations. Ideally, if there is no significant difference, using AssemblyScript means we can probably keep one code base which is preferable.
Here are some useful resources:
- Web Assembly Studio - Great environment for experimenting with different languages for WASM (C, Rust, AssemblyScript), if I'm not wrong, you can also see the generated code (so it can also be helpful with the AVR → WASM compiler).
- Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code - According to this article, I'd expect WASM to be about 60-70% the speed of writing native code. So if, for instance, we simulate a single AVR instruction with 30 WASM instructions, that'd be roughly the equivalent of 50 host CPU instructions, or x50 slowdown (so about 20MIPS for every 1000MIPS of host CPU)
@gfeun (author of #19), you may find this discussion interesting :)
Hey, i've been summoned. Yeah, i'm following along :smile:
Yes, I agree. If there is not a really breaking point which requires to translate the TS code, we should stay at TS and use the AssemblyScript compiler.
-
The Web Assembly Studio looks great. I remember to have found it before in my researches. If you can wait, I will try to create a first comparison between the Rust, C and AssemblyScript approach in the next 2 days. These languages are not my preferred ones, so I have to get familiar with the tooling 😅. I will try to create some sharable Web Assembly Studio workspaces and post them here. (And yes, the generated wasm code is visible.)
-
If we can get the point with translating the AVR binaries to WebAssembly we can overcome the 30 WASM instructions. I think this project has some really interesting possibilities to get the best out of it. If we can get a good combination of all these possibilities we get the opportunity to support a really wide range of end devices and possible runtimes! Also, 1000MIPS is for a modern CPU no problem. Also, an RPi 4 could maybe run at something around 70-100% or even more. (measured on some synthetic benchmark)
And hello @gfeun. Nice to meet you!
Yes, definitely, sounds like a good plan. It's going to be interesting :)
Hey, I've created 3 WebAssembly Studio (WAS) workspaces for C, Rust and AssemblyScript: C - https://webassembly.studio/?f=u627pgs1r5q Rust - https://webassembly.studio/?f=h80i9yrgjxa AssemblyScript - https://webassembly.studio/?f=j5dlh1kn5s
The code and folder structure is based on the empty template workspaces for each language. I've tried to bring all together in one workspace, but this is more complicated.
For the start, I have implemented in each of them the same basic functions with an identical file structure. All wat (WAS transforms them transparent to and from wasm) files are looking basically the same. Except of some different orderings and the funny point, that Rust and C are swapping the variables for the add function.
Surprisingly the AssemblyScript version is the smallest, measured on the line count. Rust and C have
this line (table $T0 1 1 anyfunc) additional, which tells me nothing because I currently can't understand the file scheme. There is also a difference for Rust and AssemblyScript in the line (memory $memory (export "memory") 17)), where AssemblyScript has 0 instead of 17.
C also has a few lines more, which are maybe only for meta informations, which are left by the others.
I know, the example functions are not really hard to interpret differently. So can you take a look at it and tell me what you think? What would be a good example function to implement to find out something more meaningful?
Currently, I have the opinion, that even if AssemblyScript could have at some point an outbreaking performance difference, we would always have the possibility to implement this specific feature in a different language.
(To see the wasm/wat code, run Build in the WebAssembly Studio)
For some reference, you can access https://developer.mozilla.org/en-US/docs/WebAssembly/Understanding_the_text_format.
Oh, wow. I just realized now that all compilers have pre-evaluated the method calls in the main function 😅.
Yes, the compile seems to be very good at optimizing - it inlines the functions and then also precalculates the result of the expression. Pretty smart!
So far, it seems like for basic arithmetic, we get roughly the same amount of opcodes. What I'd try to do next is to implement a complete opcode, e.g. define an array of hold the program data, and then implement an opcode that also reads and updates the data memory, such as ADD (the registers reside in the data memory, and it also updates the SREG register, so that would be a memory write).
Does that make sense?
I would say it makes sense 😄. I will try it first with the AssemblyScript version and do some copy and paste of the original code. After that, I will convert it to the other both.
So for correct understanding:
- Implement 'ADD' command
- Already implement basic opcode decoding?
- Small data memory to be used by 'ADD'
- Include SREG registers
Indeed, we'd need some basic opcode decoding to extract the target registers out of ADD. Also, the existing TS code can probably be used without too much changes...
Ok, I will do so. I would create an additional repo/project for this testing code? I also moved from WebAssembly Studio to IntelliJ (or any other IDE).
I had tonight another idea: Would it be possible to add some async command prefetching to reduce the time for opcode decoding?
Yes, we can either create a different repo (if the code is entirely different), or a new branch here, in case the code is still the same (or auto-generated from the current code, like I did with the benchmark).
As for IDE, that makes sense. I use VSCode for this repo, so if you open it with VSCode you should get a list of recommended extensions (prettier, eslint, etc).
What do you mean by async command prefetching?
I currently only mean for testing things. So the example AssemblyScript, Rust and C code. For the "real" implementation I would prefer a different branch with the target to bring it to master.
I will try to get warm with VSCode for dev purposes ;D.
We currently discovered the problem with the big if-else statement. The idea would be to evaluate this statement for the next opcode async. But this requires to bring the code in a format where this is possible.
Yes, it makes sense to create a new repo for experiments. Would you prefer to have it under your github user or here (under the wokwi organization)?
Not sure how you'd go about evaluating the next opcode async, maybe I need to see an example to understand?
I think because it is related to this project, it should stay under wokwi. Also because if I copy some code. And maybe there will be some more experiments in the future.
Ok, I think I see the misunderstanding. I do not mean the full evaluation only the "prefetching" of the next operation to do. But currently are all operations under their specific if-statement. A refactor in their own methods would be required. In this case, we can evaluate the next command lookup async and for example, save the required function in a variable and the "running" code then only needs to evaluate this method. Instead of evaluating the big if-else block and evaluate after. It would be something like a 2 stage command pipeline.
But this would only bring an improvement if the decoding stays the same. If the decoding is rebuilt with look-up tables or something else this would bring only an (I think) small improvement.