duktape Native code compile like V8 engine

Native code compile like V8 engine

Open kuselanonline opened this issue 9 years ago • 35 comments

Is that possible to compile js on duktape to native code like v8 engine? At least for ARM platform, so Duktape will be good alternative to node js for embedded system.

Jul 16 '15 05:07 kuselanonline

Do you mean this http://duktape.org/api.html#duk_compile?

Jul 16 '15 06:07 olebedev

I think he means just-in-time (JIT) compilation, which Duktape doesn't have--it's 100% bytecode-interpreted.

Jul 16 '15 06:07 fatcerberus

BTW, as I know, @svaarala i not exclude possibilities to have JIT in Duktape in nearest future. But this is little bit inconsistent with compact footprint.

Jul 16 '15 06:07 olebedev

There's currently no support for compiling JS code to native code right now. I wouldn't exclude it from future work, but it'd need to be added in a modular manner so that portability and compactness wouldn't be compromised by the change.

Note that just compiling bytecode into native code would have a limited performance improvement if your benchmark is V8. For example, property lookups in V8 use "hidden classes" internally which is very fast but compromises on complexity and memory usage. Without a similar change property lookup code would still be order(s) of magniture slower than in V8.

Making the JS compiler more capable is also a related need: currently the compiler works without an intermediate representation (IR) which is good for code and memory footprint but it limits what the compiler can do. Rewriting the compiler to use an IR (which is required by some ES6 constructs so it needs to be done at some point) also makes it possible to add optional optimization passes to compilation in a modular fashion. This would be nice for performance optimization in general.

Jul 16 '15 16:07 svaarala

Thank for clarification! It would be a useful improvement for some systems which is focused on performance instead compact footprint and still need to have portability. High performance V8 engine is not portable enough and build is too complicated. So duktape looks like reasonable alternative.

Jul 16 '15 17:07 olebedev

Is it fits into nearest plans?

Nov 17 '15 13:11 olebedev

Probably not in the near future, unfortunately. There's more important work (ES6 stuff for instance) and no volunteer with JIT experience to spend a lot of time figuring out a JIT approach which doesn't impact portability or low memory targets (with JIT turned off obviously). Adding a JIT which makes porting more difficult or compromises low memory targets is not very useful.

And like I've said before, plain JIT is not all that helpful without also improving compiler optimization capabilities in general (some opportunities will open up for this with ES6 work which requires a more stateful compiler), optimizing property accesses somehow (V8 gets much of its speed from hidden classes), inlining function calls, type inference, reference count operation collapsing, etc.

Nov 17 '15 13:11 svaarala

A little clarification to the above: I don't mean all those optimizations would need to be implemented right away. But there needs to be an architecture which allows them to be implemented incrementally without compromising the portability/low memory goals or requiring extensive rewrites as features are added.

As a concrete example: a very simple JIT could be implemented by adding a post-step to the Ecmascript compiler which would take the bytecode from the compilation result and compile it into machine code (perhaps optimizing some operations like reference count updates a bit). There would be some immediate issues with (line numbers and tracebacks, debugger integration, etc). But ignoring these, one could then extend the JIT compiler piece by piece.

But I'm afraid this would be somewhat of a semi-dead-end because adding the features needed to be competitive with e.g. V8 requires fundamental changes elsewhere - like property handling, run-time type detection for specialization, deoptimization, etc.

Consider a function which currently spends 50% in bytecode dispatch and other overhead, and 50% in property accesses. Even if the JIT eliminated all of the bytecode execution overhead (dropping from 50% to 0%) the code would still only run twice as fast as before, and no improvement was possible without optimizing the property accesses. I might not be personally happy with JIT performance like this for example.

Nov 17 '15 13:11 svaarala

@indutny, would you have time / interest to see if JIT can be added without compromising the main goals? I seem to remember candor-lang was fairly simple.

Nov 17 '15 14:11 creationix

@svaarala what about a tracing jit like luajit instead of the inline-cache and hidden class style of V8? Is JS code just not suitable to this kind of optimization?

Nov 17 '15 14:11 creationix

Hmm, I don't see how a tracing JIT would handle e.g. property access optimization any better than a per-function JIT, considering that properties can change arbitrarily at any time? The JIT could maybe make some assumptions about properties or inheritance not changing, but verifying these assumptions at runtime or invalidating code whose assumptions are violated could be pretty complex (the functions may already be executing, etc). It's much easier if one doesn't care about compliance.

At a high level I'm sure there are several approaches that would work reasonably for a basic JIT (disregarding the other optimizations). I'd really like to prototype a few approaches to get a concrete feel of what works best in practice - it's often the fine details which pose surprising problems.

Keeping the interpreter and the JIT in the same code base probably means some compromises on the JIT part so that they can coexist without needing to fork or rewrite. If those compromises are too big, a separate project might make more sense, with the downside of being a bit confusing for users.

Nov 17 '15 15:11 svaarala

Right, I don't think a luajit style jit would work well for the way most people tend to write JS code with tons of property lookups. When I'm writing luajit code that needs to be fast I avoid property lookups whenever possible and look at the generated bytecode to verify my compiler assumptions.

Nov 17 '15 15:11 creationix

@creationix are you talking about something like a full-codegen in V8, i.e. generate very simple assembly through one traverse of the AST?

Nov 17 '15 18:11 indutny

@indutny I'm not sure, just trying to think of a way JIT could help speed up duktape without completely blowing up the memory requirements and/or codebase complexity.

Nov 17 '15 18:11 creationix

The biggest obstacle right now, as I understand it, is that the Duktape compiler has no AST or any kind of intermediate representation at all. It's all recursive-descent with very little prior context available at any given step.

Nov 17 '15 18:11 fatcerberus

Oh yeah, @fatcerberus , this is going to be a big obstacle, but it should be solvable.

Nov 17 '15 18:11 indutny

If one were to implement JIT right now it'd be easiest to base on converting bytecode into machine code. An AST would of course be preferable and that will become available with the ES6 work.

Nov 17 '15 18:11 svaarala

@svaarala reading through the code.

Nov 17 '15 18:11 indutny

@indutny Here's an example of how the current code generation works: https://github.com/svaarala/duktape/blob/master/doc/compiler.rst#ivalue-example

In short, there's a tiny fragment of an intermediate representation, one step away from becoming a RHS or LHS value. The compiler combines and coerces these "ivalues" and emits bytecode on the side.

ES6 support will require at least a statement level IR or (more likely) a function level IR. Compilation is a big issue on low memory targets due to memory usage, and IR will exacerbate that problem, so there's a challenge on how to do that with a minimal RAM impact.

Nov 17 '15 18:11 svaarala

What platforms should this JIT run on? ARM?

Nov 17 '15 18:11 indutny

From code structure point of view multiple targets would need to be supported (at a future time, of course). There are targets running x64, x86, ARM, PowerPC, SuperH, MIPS, Motorola 68k, etc. For most users I'd expect x64 and ARM to be most relevant right now - but that's just my view.

Nov 17 '15 18:11 svaarala

x86, x64 and ARM would probably be a good starting point. x86 is still relevant since it's the preferred target for Windows applications that don't need the full 64-bit address space.

Nov 17 '15 18:11 fatcerberus

Thank you for the information, I will look through the code and reply later today (or tomorrow).

Nov 17 '15 18:11 indutny

Keep in mind there may be some mobile ARM platforms (such as iOS) that won't benefit from this due to limitations/restrictions in the underlying platform. This is one of Duktape's biggest strengths right now, it's quite fast (and getting faster) unlike most other interpreted JS engines.

Nov 17 '15 18:11 fatcerberus

Throwing out some ideas:

Inline Caching for binary operations

It is very useful to have operand types information at binary operation call sites. I.e. + could technically be adding strings, numbers, or whatever. In the most of the cases, it just works with monomorphic data at the particular code site. This is not that useful in the interpreter mode, but is quite practical for JIT.

Inline Caching based on hidden classes

This is not much more complicated than the first suggestion, but it will introduce new heap objects: maps. There is certain performance-to-memory-usage balancing here, and it is mostly controlled by a number of transitions that we will allow before giving up and deciding that the object is used just as a hashmap.

I think that ideally Maps could be quite compact, and the objects that are using them could be made a bit more compact too. In other words, right now objects have some kind of fast properties, something that is called entry party in duktape terminology. Both property name, descriptor and the value are stored there, and with Maps - name and descriptor could be shared between many objects.

So I think after all it could be a slight win in memory usage. I highly recommend exploring this (and maybe will try to do it myself).

Direct Threading

This is alternative to real JIT compilation, that kind of plays well with the bytecode that we have right now. Instead of doing a loop over instructions, each instruction opcode could be an address of the function that should be executed. This way we can cut the dispatching costs significantly, and if I remember it right - this is the way JavaScriptCore compiler does it for the bytecode.

Non-optimizing JIT

It may be reasonable to have both ways of running code simultaneously: JIT and bytecode interpreter. This implies that these functions should be able to call each other.

Generally JIT could be used in duktape in two ways: without register allocation, or with it.

Without Allocator

This is kind of easy way, some fixed amount of bytecode registers actually lives in a real CPU registers, others live in the stack frame (activation of the function). Each opcode is transformed to some generated code. Complex opcodes, or edge cases (like adding string and a number) should live in lazily-generated shared code stubs.

With Allocator

Not that easy here. This is very unlikely to do a full-blown register allocator without constructing the SSA+CFG. Liveness analysis and moves needs to be performed on blocks.

However, we could have some sort of limited register allocator that won't actually split the intervals, and will assign either register or stack slot for a whole lifetime of the virtual registers. I think this is very good idea in general, and should be explored if the simplistic non-register-allocating JIT will be implemented.

Nov 18 '15 04:11 indutny

Please let me know what are your thoughts on this, and which ideas you think could be worse exploring.

Nov 18 '15 04:11 indutny

Perhaps it's just my inexperience, but I've not yet been able to write a direct-threaded interpreter that is actually faster or uses less ram (usually it's worse at both). Of course I've been writing interpreters for a much simpler language than duktape does. If JSC was able to make it faster, maybe it can be done here too.

I'm really excited about the idea of low transition limit for hidden classes. We could tell people who care about performance to only modify their objects' shape once or something. Also if it's feasible, the number of transitions could be tunable at compile-time since duktape is used in a broad range of machines from ultra low memory to server-class hardware.

I'll defer to @svaarala for more meaningful feedback.

Nov 18 '15 05:11 creationix

Direct threading might enable incremental optimization because one could replace opcode sequences with target functions implementing the sequence in an optimized manner (even up to full traces maybe). But it's somewhat awkward as the baseline interpreter because it increases memory usage quite a bit for the bytecode, a significant issue for very low memory targets, so I don't see it could replace the current dispatch loop.

I tried direct threading a long time ago when working through dispatch alternatives, opcode formats, etc. It didn't look very good IIRC, and to be efficient with static compilation one probably needs something like gcc's label pointers (https://gcc.gnu.org/onlinedocs/gcc-3.0.1/gcc_5.html#SEC70) which are non-portable. But maybe there's some better way? For code generated on-the-fly there are no limitations like that but code generation is not a suitable portable baseline for non-JIT targets.

For the hidden classes etc: like I said above, there are quite a few viable approaches, but without prototyping it's quite difficult, at least for me, to say what works well. For example, hidden classes require quite a bit of "plumbing" here and there to work - how much of that will conflict with non-JIT usage, and will that play well for low memory targets where hidden classes might not be feasible (some targets run with 64kB or less allocated for the Javascript heap)?

As a general observation, it would be very nice if a single bytecode dispatcher could be used for both non-JIT and JIT cases (#ifdef'ing away the unnecessary JIT related stuff when not using JIT). But this might not be achievable in practice because the goals may be quite different. Having two baseline dispatch loops (one for non-JIT, another acting as the JIT "slow path") is not ideal but it's not certainly a show stopper.

Nov 18 '15 19:11 svaarala

Regarding property accesses, there's obvious room to improve the hash and lookup behavior (planned for 1.5.0 release), so some improvement in coming in the near term.

One relatively simple approach to improve property accesses for inherited properties (methods) would be to detect that a property read was inherited, and that the property can be safely cached to the object itself because (1) the property is non-writable and non-configurable, and (2) there are no extensible objects in the middle which might later capture the property. Caching differs a bit from copying because a cached property would only be used in property lookup, but wouldn't be listed as an "own property". By expanding the hash part size for frequently accessed objects, property lookup would approach a single hash lookup.

This is of course less efficient than hidden classes, but simpler and has less of a memory and architecture impact.

Unfortunately this would probably not work well because:

User code doesn't usually set up properties to be non-configurable and non-writable because it's not needed to work well with e.g. V8.
Even if they were, it's possible to change prototype chains on-the-fly using e.g. Object.setPrototypeOf() which would need to trigger invalidation of the cached properties from the inheriting objects for which we don't have a reference. Adding that book-keeping has a performance impact of its own.

It would be nice if property access optimizations worked without a JIT so that they'd benefit the exotic platforms too. Having the JIT rely on some memory hungry property optimization would be OK on the other hand, IMO.

Nov 19 '15 10:11 svaarala

In the GNUstep Objective-C runtime, method lookup returns a structure that contains a version field. You can cache the structure and the version and, if they match, use the cache. Invalidation happens only when you override the method in a subclass (zer is a special 'do not cache' value - if a version overflows then it wraps to zero and we don't increment it anymore). If you update the slot for a given class, then its version isn't incremented - the new value is in the same structure and reusable. This does provide a speedup, but the cost of the atomic operations for thread-safe lookup offset it somewhat (a problem that JavaScript does not have).

I prefer the architecture of JavaScriptCore to V8 and it's probably a better inspiration. For JIT compilation, I think that it's important to have some well-defined goals. JIT compilation is only useful for things that are both CPU-bound and long-running. I would actually be more interested in being able to AoT compile hot paths, where you can burn a load of cycles on a nice fast machine (after some profiling runs) and benefit from that speedup everywhere. I don't think that trying to make DukTape compete with V8 is interesting: they have very different targets.

For teaching JIT techniques, I use a toy language that is intended to capture most of the difficulties of compiling JavaScript. After working on this, my students are expected to do a miniproject on some real-world codebase. They've already chosen the ones for this year, but next year I'd be happy to point some of them in the direction of DukTape.

Dec 03 '15 11:12 davidchisnall

AoT compilation would be quite interesting I agree. One could either produce native functions out of that, or produce much better bytecode than the runtime compiler can achieve.

@davidchisnall Since you're working in an educational setting, it'd be interesting to hear what changes would make doing so easier. There's some conflict between trying to achieve a small size and keeping things easy to approach. But even so, much can be improved without compromising on those goals. Some design decisions early were not ideal, and some areas of the code could use a restructuring; this is an incremental process but any pointers in making Duktape more useful to student hacking would be much appreciated :)

Dec 03 '15 11:12 svaarala

Aside from the somewhat convoluted build system, the DukTape code is pretty approachable and well documented - far more so than some of the other codebases that I've had students doing projects on (one of the advantages of being at Cambridge is that I generally have very competent students).

I think that you currently have a goal of keeping the bytecode private so that it can be easily changed. Longer term, having a more stable (and well documented) bytecode is probably helpful. In particular, I'd love to see DukTape provide a stable interface for attaching JITs / AoT code generators, without actually providing the JIT itself. This would make it easier to plug in different JITs for different uses and would mean that I'd be able to set up a skeleton project and have students hack on something. For JavaScript JITs, on-stack replacement ('deoptimisation') is also very important, so JITs (and AoT compilers) are going to need some mechanism for calling back into the interpreter when it doesn't make sense. The CoreCLR team that's been working on an LLVM-based JIT has found it very useful to have a JIT API that allows the JIT to say 'sorry, can't compile this' and fall back to the interpreter (or a more mature, but less fast JIT).

Better documentation of some of the internals would also be helpful. I've implemented a subset of the typed array and web worker specs using the public APIs while playing with DukTape, but it's often a bit hard to follow how builtin objects are implemented using the private APIs for comparison.

Dec 03 '15 12:12 davidchisnall

Thanks, sounds pretty much what I was expecting, I'll do my best to advance Duktape in these areas :) The build system and testing framework will be upgraded next year, e.g. automatic regression coverage on more platforms is needed and will be addressed using a distributed testing framework. Agree about it being important for the JIT part to be modular, and I wouldn't want to impact portability of the non-JIT core because of too deep JIT changes.

Anyway, feel free to post feedback if Duktape is used on your courses - I've done some lecturing myself earlier on so I have a soft spot for that, and it'd be great if Duktape could be helpful in that setting :)

Dec 03 '15 12:12 svaarala

Bit of an old thread, but I though this might be of interest. I've been experimenting with compiling Duktape bytecode into C. The code I have created is very primitive and limited, but it does illustrate that it is possible to do. Note what I have done so far is not a general purpose method as it makes a bunch of assumptions about the input code (input code must be a single function, input params must be ints or typed arrays, all arithmetic is on integers, etc).

Code is here: https://github.com/cosinusoidally/mishmashvm/blob/master/tests/jsmpeg/YCbCrToRGBA_bc_version.js

This code translates the colourspace conversion code from jsmpeg from Javascript to C. When I said it was primitive and limited I mean it. It is capable of translating YCbCrToRGBA.js but not much else.

To run it you need to check out the main project https://github.com/cosinusoidally/mishmashvm and follow the README to set up the system. The system normally runs on top of a stock Mozilla Spidermonkey binary, but it is also capable of building a Duktape based version of itself.

Once you are set up make a test.js file above the parent directory containing the following:

load("mishmashvm.js");
delete Duktape; // this prevents the manually ported C version of YCbCrToRGBA from being used (this line is only needed when running under Duktape)
use_bc=true; // this activates the JIT that takes Duktape bytecode as its input
use_c=true; // this controls whether the JIT generates JS or C as its output
perf=true; // this prints perf timing
test(12); // when mishmashvm starts it presents a menu, test(12) is the jsmpeg test

then run the following from the main mishmashvm directory:

js ../test.js

If it errors out asking for a video file then grab the video and place it in the correct directory as instructed.

Pipeline is: JIT compile a copy of Duktape from C run duk_compile_string_filename on YCbCrToRGBA.js to obtain the duktape bytecode parse the bytecode in JS Break down the bytecode into basic block Convert the basic blocks into C functions JIT compile the generated C code to machine code Call the machine code via js-ctypes (I also added a very crude js-ctypes FFI implementation to Duktape, so this will also work under Duktape)

Performance of the generated code is good enough for realtime video playback, but slower than manually ported C code.

Generated code looks a bit like this:

int f119(unsigned int *regs){
int r;
int ro;
int cv;
// ip: 119
// 0,1,0,DUK_OP_ENDLABEL
// endlabel: 1

// branch: undefined

// ip: 120
// 10,8,8,DUK_OP_ADD_RR
regs[8]=regs[8]+regs[10];
// branch: undefined

// ip: 121
// 10,9,9,DUK_OP_ADD_RR
regs[9]=regs[9]+regs[10];
// branch: undefined

// ip: 122
// 15,13,13,DUK_OP_ADD_RR
regs[13]=regs[13]+regs[15];
// branch: undefined

// ip: 123
// 15,14,14,DUK_OP_ADD_RR
regs[14]=regs[14]+regs[15];
// branch: undefined

// ip: 124
// 12,11,11,DUK_OP_ADD_RR
regs[11]=regs[11]+regs[12];
// branch: undefined

return 1;

}

Above is an example of a single basic block. Example of the fully translated function is here: https://gist.github.com/cosinusoidally/c6443f6ccfcdc347383a1a3c2e3f9faf

Jul 31 '22 13:07 cosinusoidally

duktape duktape copied to clipboard

Native code compile like V8 engine

Inline Caching for binary operations

Inline Caching based on hidden classes

Direct Threading

Non-optimizing JIT

Without Allocator

With Allocator

duktape
duktape copied to clipboard