design Request for opcode: dup

Hi,

I would like to request a webassembly opcode: dup. The semantics would be that the top of stack is duplicated.

Regards, Windel

Aug 16 '20 13:08 windelbouwman

Hi @windelbouwman, I took the liberty of transferring this issue to the design repo, where spec proposals usually start. You can see the full process all changes would need to go through to become standardized here.

Can you explain the motivation for this proposal a bit more? dup can currently be emulated by a local.tee instruction followed by a local.get instruction, so as far as I can see this proposal would only be a code size improvement. Do you have an estimate on what the code size savings could be if we introduced a dup instruction?

Aug 16 '20 19:08 tlively

Not only the code size is improved, but also the readability of the code, the speed of the direct interpretation, also there will be no need of a local variable (that further obscures the simple operation of a duplication). This dup instruction is discussed in few other issues, maybe someone can link them. It is really interesting that instructions like clz ctz found their place into the basic WASM byte code instruction set, but not dup and i32.not_equal_z.

Aug 17 '20 07:08 verbessern

Hi!

Thanks for moving this request. I understand the request for a concrete use case for this instruction.

The specific use case where this would be handy, is when implementing a for loop. As a gimmick, I tried to make a python -> wasm compiler, and when implementing the for loop, it would be handy to leave the loop index value on the stack. Now I solved this by creating a new local per for loop, but this results potentially in sloppy code, when for loops are not nested. Some extra logic is required to generate the least amount of locals for the for loops.

Exact spot of the use case is here:

https://github.com/windelbouwman/corepython/blob/master/corepython/src/compile.rs#L205

This is one use case, but I can imagine there are more use cases.

Notice also the JVM has a dup instruction: https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.dup

It might be worth noting that the JVM machine and webassembly machine are pretty similar, so there might be more ideas to copy from the JVM.

Hope this clarifies the request a bit.

Regards, Windel

Aug 20 '20 18:08 windelbouwman

Do you have an estimate on what the code size savings could be if we introduced a dup instruction?

No, I do not. Interesting question though, how would one create such an estimation?

I can see compilers exploiting this instruction, and spec implementations implementing this pretty straight forward.

Aug 20 '20 18:08 windelbouwman

The function-references proposal introduces a let construct that provides the ability to implement dup. In essence, it allows you to create new locals from stack values. Some have also suggested a pick instruction, (where dup is equivalent to pick 0). This doesn't give you quite what you want though, since you still may need to do some kind of rotate or swap to get the appropriate stack layout.

Aug 20 '20 18:08 binji

I remember discussing with @rossberg a few years ago that the typing of syntactically dead code would become much more interesting if a dup instruction of type [t] -> [t,t] were to be introduced, because one would have to ensure that multiple uses of the same "unknown type" stay consistent with each other.

e.g. the code fragment

unreachable
i32.clz
drop
i64.clz

is well-typed, but the code fragment

unreachable
dup
i32.clz
drop
i64.clz

would be ill-typed. But this requires some more sophisticated book-keeping than has been necessary up to this point,

The smallest solution might be to require that dup is explicitly type-annotated. At a quick glance, it looks like Java avoided this problem by being more restrictive in its validation of dead code, in a way that probably wouldn't work for Wasm (https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.return).

Aug 20 '20 20:08 conrad-watt

The function-references proposal introduces a let construct that provides the ability to implement dup. In essence, it allows you to create new locals from stack values. Some have also suggested a pick instruction, (where dup is equivalent to pick 0). This doesn't give you quite what you want though, since you still may need to do some kind of rotate or swap to get the appropriate stack layout.

You are right, a swap operation would also be very convenient. Can I upgrade my request to include both the swap and the dup operations? swap could arguably we called rotate, as is done in the python bytecode, but that might be confusing with the bit rotation operations.

Aug 21 '20 06:08 windelbouwman

Still my primary use case for this would be implementation of a for loop without having to generate locals.

Aug 21 '20 06:08 windelbouwman

I think that a combination of the validation logic that applies to local.tee + local.get should be the base for creating the validation logic of dup, by making the local implicit.

Aug 21 '20 07:08 verbessern

Similar discussion in the interface-types proposal: https://github.com/WebAssembly/interface-types/issues/118

Aug 21 '20 11:08 mikevoronov

@verbessern local.[tee/get] only avoid the issue I outlined above because these instructions explicitly refer to a type-annotated local. This is analogous to requiring an explicit type annotation for dup.

Another solution, avoiding the explicit type annotation, would be to weaken Wasm's validation so that syntactically dead code is no longer validated. i.e. introduce a ⊥ stack type and the typing rules

C ⊢ unreachable : t* -> ⊥
<similar for return, br, br_table>

C ⊢ e* : ⊥ -> t*

I've been told that there were extensive discussions over this during Wasm's initial design.

FWIW, the same issue applies to pick, but not swap.

Aug 21 '20 11:08 conrad-watt

Ignoring any dead code would be a wonderful simplification to make, yes.

Failing that, add i32.dup and friends? :)

That wastes opcodes, but certainly it being polymorphic wasn't useful anyway.

Aug 21 '20 21:08 aardappel

Ignoring any dead code would be a wonderful simplification to make, yes.

Failing that, add i32.dup and friends? :)

That wastes opcodes, but certainly it being polymorphic wasn't useful anyway.

It might be most natural to have it be one opcode with a static parameter, otherwise it wouldn't scale to reference/GC types. This slightly reduces the impact of code size savings though.

Also, in the scheme I gave above, dead code would still have to be parsed for "malformedness"/encoding errors. The spec's distinction between parse/validation-time errors has cropped up as a gotcha in tests before and this would become another example.

Flinging out 4 (really 3) ways forward:

Pursue dup with a static type parameter akin to a local declaration.
Pursue polymorphic dup, relaxing dead code validation (maybe arouses strong opinions?).
Don't pursue dup.
Pursue polymorphic dup as-is, requiring validators to handle some sort of ad-hoc parameterised type? (strongly against)

(also, all of the above but for pick; as @binji pointed out, it generalises dup so it might be preferable)

Aug 21 '20 22:08 conrad-watt

One thing I didn't initially think about. For pick, option (1) is much less ideal, since the following would have to be a type error:

unreachable
i32.pick 900
drop
i64.pick 900

In an archetypal single-pass validator like the OCaml reference implementation, typing the i32.pick 900 line would require creating ~900 elements at the head of the current stack type in one step. Having to think about typing dead code to this extent seems like a bad smell.

Aug 21 '20 22:08 conrad-watt

I'm not convinced that the losing of the stack polymorphism is a good idea. I think its a nice thing to have.

Aug 22 '20 07:08 verbessern

What's the advantage of stronger dead code validation? It's a corner case that shouldn't come up very often in actual wasm uses, right?

Aug 22 '20 18:08 taralx

@taralx I believe the original motivation for the current approach to typing dead code was to ensure that a well-typed sequence of instructions can be arbitrarily split, and the resulting contiguous subsequences are guaranteed to be well-typed. I don't know if any tooling is relying on this.

@rossberg might have a first-hand memory of the decision-making process.

Aug 22 '20 19:08 conrad-watt

Some browser vendors raised concerns at the time against allowing unvalidated code. I believe there were fears that it could become part of a potential attack vector, which is a reasonable concern. Another option that was discussed was disallowing dead code altogether. The rationale for the current design is summarised here.

The simplest solution would be a dup with a type annotation or a pick instruction with a stack type annotation of the appropriate depth. However, let is more general and can express all stack manipulation operators like pick, swap, or rotate (which you'll need otherwise).

Aug 24 '20 07:08 rossberg

What about the use case of wanting to square or cube a number that is the result of the previous operation? One example is computing the distance between two points. It seems odd to have to put the result of x1 - x2 in a local variable and then get it twice in order to computing the square.

Feb 20 '21 13:02 mvolkmann

Since WebAssembly is meant to be a compilation target rather than a convenient language to write by hand, it’s generally not a problem that certain code patterns require the use of locals. Using stack instructions instead of locals would not affect the performance of optimizing engines. The primary reasons to potentially add stack duplications instructions are to handle values that could not reasonably be stored in normal locals, such as non-nullable reference types (depending on your point of view).

Feb 21 '21 10:02 tlively

I understand that WebAssembly is not typically hand written, but what is the down-side in making that practice easier/nicer by adding an instruction? Beyond dup there could be a sqr instruction to square a number since it is a relatively common operation.

Feb 21 '21 12:02 mvolkmann

Besides the typing issue discussed above, there's probably no downside in the sense you're thinking of. It takes a surprising amount of work and time to spec new instructions and get them implemented in every tool and engine out there, though, so generally only changes with significant benefits get all the way through the process. Unfortunately that means that there are a lot of "nice to have" proposals, even tiny ones like adding a single dup instruction, that don't make the cut. I'm not saying we'll never add dup, but if we do it will because it solves an important problem so lots of folks agree it's important to add and will be motivated to implement and maintain it throughout the ecosystem.

This is one of the costs of standards-based work; if Wasm were controlled by a single party, it would be easy to add a single instruction like dup. Since it's not, you first have to get a lot of different people with different priorities and opinions to agree that adding dup is both a good idea and worth their time and effort. Because of this extra consensus-building work, the community can have more confidence in the robustness and benefits of the proposals that do make it through the process.

Feb 22 '21 08:02 tlively

Thanks for explaining that Thomas!

Feb 22 '21 13:02 mvolkmann

There may be a hidden benefit to not supporting a dup instruction: it makes reasoning with linear types a great deal easier. Of course, linear types is also a thing for the potential future.

Feb 22 '21 16:02 fgmccabe

what @tlively said should be the intro for "So you want to propose a feature?" faq :P

Feb 22 '21 17:02 aardappel

Coming back to this, it's unclear to me whether the general consensus was that dup and swap opcodes wouldn't be possible or there just wasn't an appetite for them. I'd love to know whether either of them are at all possible in the future.

Actively shipping a JIT right now, having these opcodes would make it possible to generate smaller and more efficient code, since right now it seems like the use of temporary locals to duplicate a value can in many cases cause the value to unnecessarily get stored to the wasm shadow stack in memory. dup and swap instructions would (by my understanding) boil down to the equivalent of register renames/references and not require any actual generated code to implement in many architectures compared to the loads/stores involved in teeing and getting locals.

The need to have temporary locals to house values the user intends to duplicate or swap also makes the stack frame of every function larger, and I feel like that is another good justification for having opcodes that allow skipping the temporary locals. In an SSA form, I think each time that temporary local is overwritten for a dup or swap, that would introduce another SSA variable and complicate things, right?

To provide a couple specific examples (in pseudocode), null checks look like this:

i32.load [address of pointer in memory]
local.tee $nullcheck_ptr
i32.eqz
if
  throw
else
  local.get $nullcheck_ptr
  ... use the value ...
end

This is awkward and bloated, especially if I'm throwing around multiple nullable pointers. But one can live with it, and it's just one reusable local (that sadly is messy to optimize or analyze since it's constantly being overwritten).

Overflow/underflow checks get worse though, like if I want to convert a float to a uint16 it looks something like this:

f32.load [address of some value in memory]
local.tee $conv_f32 # note I need a unique one of these for each value type
f32.const 0
f32.flt
if
  throw
end
local.get $conv_f32
f32.const 65535
f32.fgt
if
  throw
end
local.get $conv_f32
i32.trunc_u
... use the value ...

We now have two loads from a named temp local, and every new type added to the wasm type system means at least one of these locals that needs to be dealt with separately. Once we have custom struct types or gc types etc I can imagine having dozens of these.

To go further, we have to overflow/underflow check binary operations, like multiplies or divides - which means we need two temporary locals of each type, one to cache the lhs and one to cache the rhs. It gets out of hand quickly. For the single-value case, dup tremendously simplifies the code and allows values to stay entirely in CPU registers instead of getting spilled to the shadow stack. For two-value cases like binary operations, I think you'd also need swap to work entirely on the wasm stack, but having dup would at least let you optimize out one of the two temporary locals, I think.

The validation concerns make sense and it seems like having to explicitly pass an expected type to dup and swap would not be a big problem. It would potentially remove most of the code size advantage since dup <type> and local.get <index> are about the same size, but getting rid of the extra locals is still a win, IMO.

Is this something that is too difficult to implement in engines like v8 and spidermonkey? Is there no appetite for it since most people are just generating code with llvm and it deals with all of this under the hood?

Feb 17 '23 01:02 kg

Is there no appetite for it since most people are just generating code with llvm and it deals with all of this under the hood?

If anything, that may be an argument for introducing dup, since all code would benefit from this simplification. Someone could experiment with a "collapse locals to dup" pass in Binaryen to see how many bytes it would typically save.

Feb 17 '23 01:02 aardappel

Now that let has been removed from the function-references proposal, maybe it is time consider what @rossberg suggested above: a dup or pick instruction that includes a stack type annotation.

Feb 17 '23 01:02 titzer

I assume pick would look something like i32.pick 2 to pick the 3rd (indexing starting at 0, so pick 0 is dup) item on the stack? Definitely more powerful than dup or swap alone, I don't have a good sense for whether most implementations would have more trouble implementing it since it's so powerful in comparison.

If we only get one shot at having this kind of opcode added it feels like pick would be ideal, assuming I understand it right. It would be very useful for a store-to-load forwarding pass I'm looking at doing now, in fact, by freeing me from the need to arrange the wasm stack just so in advance to match the order of future loads.

Feb 20 '23 21:02 kg

design design copied to clipboard

Request for opcode: dup

design
design copied to clipboard