design icon indicating copy to clipboard operation
design copied to clipboard

[Text Format] Allow the data in data segments to be written in a vector of u8 and s8 numbers.

Open echamudi opened this issue 4 years ago • 15 comments

Hello,

In the current specification, when we initialize data segments, we are only allowed to use a vector of strings.

Screen Shot 2020-05-26 at 8 00 30 PM

For example:

(data (offset (i32.const 0)) "a" "b" "c")

or

(data (offset (i32.const 0)) "abc")

To save some arbitrary numbers, we can use backslashes and character hex numbers.

(data (offset (i32.const 0)) "\01\02\03\04\05" )

But this approach feels a bit hacky. Wouldn't it better if we are also allowed to put a vector of numbers that can be written in numbers directly?

For example:

;; in decimal
(data (offset (i32.const 0)) 1 2 3 4 5 254 255)
;; in hex
(data (offset (i32.const 0)) 0x01 0x02 0x03 0x04 0x05 0xFE 0xFF)

And I think this addition still conforms with the core syntax definition, which defines the data part as a vector of bytes:

Screen Shot 2020-05-26 at 8 01 06 PM

echamudi avatar May 26 '20 17:05 echamudi

I agree that the current format is pretty limited. That said, I think if we change this we'll want to have something more comprehensive. For example, see this simple raytracer I made. These values are f32 constants, but it's really hard to tell.

We'll probably want syntax for this (maybe (f32.const x)? Though this raises questions how to handle u8/u16, since there's no u16.const instruction...)

binji avatar May 26 '20 19:05 binji

Interesting demo :)

If we want something more advanced than an array of bytes, maybe we need to introduce special keywords for them? For example:

data ::= '(' 'data' memidx '(' 'offset' expr ')' datacontent* ')'

datacontent ::= string
             | '(' 'data_u8' u8 ')'
             | '(' 'data_u16' u16 ')'
             | '(' 'data_u32' u32 ')'
             ...
             | '(' 'data_f32' f32 ')'
             | '(' 'data_f64' f64 ')'

Here's the usage example:

(data (i32.const 0)
  "hello"
  (data_f32 3.14159265359)
  (data_u32 0xABCD)
)

Because they are converted to data in the segment data during the wat2wasm compilation, we don't need to add any additional valtype.const instructions (e.g. u16.const, etc).

And the additional grammar above should be backward compatible, as it still accepts strings.

echamudi avatar May 26 '20 21:05 echamudi

Because they are converted to data in the segment data during the wat2wasm compilation, we don't need to add any additional valtype.const instructions (e.g. u16.const, etc).

Good point. In that case it could use the same format anyway (which is also used in other places too, e.g. assertions), as if we had an u8.const instruction.

This is a pretty simple change, but still requires going through the proposal process for standardization. Would you want to champion it?

binji avatar May 26 '20 23:05 binji

Sure, I would be glad to do that. What are the steps that I should take?

echamudi avatar May 27 '20 00:05 echamudi

Next steps would be to bring this to a future wasm community group meeting to see whether the group is interested in pursuing it. Are you a community group member? If not, you can join here: https://www.w3.org/community/webassembly/

binji avatar May 27 '20 16:05 binji

Suggestion:

datacontent ::= string
             | '(' 'data_u8' u8* ')'
             | '(' 'data_u16' u16* ')'
             | '(' 'data_u32' u32* ')'
             ...
             | '(' 'data_f32' f32* ')'
             | '(' 'data_f64' f64* ')'

So for example if we wanted to have a vector of 3 floats, instead of having to say (data_f32 1.0) (data_f32 2.0) (data_f32 3.0), we could say (data_f32 1.0 2.0 3.0)

Could take that idea even farther and allow for a mixable data stream. For example, variable width vectors that are prefixed by their length: (data_stream i32 2 f32 1.0 2.0 i32 3 f32 3.0 4.0 5.0) Not sure how useful that would actually be, just spitballing at this point.

jgravelle-google avatar May 27 '20 17:05 jgravelle-google

@jgravelle-google's suggestion seems to be useful. For example, this copied code from @binji's ray tracer:

(data (i32.const 0)
  "\00\00\00\00" ;; ground.x
  "\00\00\c8\42" ;; ground.y
  "\00\00\00\00" ;; ground.z
  "\00\00\c2\42" ;; ground.r
  "\3b\df\6f\3f" ;; ground.R
  "\04\56\8e\3e" ;; ground.G
  "\91\ed\bc\3e" ;; ground.B
)

can be re-written as

(data (offset (i32.const 0))
  (data_f32 0 100.0 0 97.0)     ;; ground.{x,y,z,r}
  (data_f32 
    0.936999976635              ;; ground.R
    0.277999997139              ;; ground.G
    0.368999987841              ;; ground.B
  )
)

Regarding @binji's previous comment:

Good point. In that case it could use the same format anyway (which is also used in other places too, e.g. assertions), as if we had an u8.const instruction.

Yeah, maybe it's better to use those formats instead of introducing the new ones. So I'm thinking the possible updated grammar would be this:

data ::= '(' 'data' memidx '(' 'offset' e:expr ')' datacontent* ')'

datacontent ::= string
            | '(' 'u8.const'  u8+ ')'
            | '(' 'u16.const' u16+ ')'
            | '(' 'u32.const' u32+ ')'
            | '(' 'u64.const' u64+ ')'

            | '(' 's8.const'  s8+ ')'
            | '(' 's16.const' s16+ ')'
            | '(' 's32.const' s32+ ')'
            | '(' 's64.const' s64+ ')'

            | '(' 'f32.const' f32+ ')'
            | '(' 'f64.const' f64+ ')'

echamudi avatar May 27 '20 18:05 echamudi

Nice, I like this! Although we can simplify it further, since we already require that i32.const etc. handle both signed and unsigned values. So we could have just:

datacontent ::= string
            | '(' 'i8.const'  i8+ ')'
            | '(' 'i16.const' i16+ ')'
            | '(' 'i32.const' i32+ ')'
            | '(' 'i64.const' i64+ ')'

            | '(' 'f32.const' f32+ ')'
            | '(' 'f64.const' f64+ ')'

binji avatar May 27 '20 18:05 binji

@binji Great, I hope it can make its way to the text format spec. I've just joined the wasm community group with id: echamudi.

echamudi avatar May 27 '20 19:05 echamudi

@echamudi Thanks! If you can attend one of the next CG meetings (the next is June 9th) to present, that would be best. You can open a PR to add this as an agenda item: https://github.com/WebAssembly/meetings/tree/master/2020

binji avatar May 27 '20 21:05 binji

Can we simplify that further to

dataval ::= string
            | '(' 'i8'  i8+ ')'
            | '(' 'i16' i16+ ')'
            | '(' 'i32' i32+ ')'
            | '(' 'i64' i64+ ')'
            | '(' 'f32' f32+ ')'
            | '(' 'f64' f64+ ')'

This mirrors what we did for SIMD values, and it avoids mistaking these value encodings for instructions. Also, I would tend to replace + with *, to be consistent with other n-ary constructs in the text format.

rossberg avatar May 28 '20 08:05 rossberg

Looks fine with me. Also, I found out that the previous syntax (especially i32.const) can be confusing when we use the offset abbreviation.

(data (i32.const 10) (i32.const 20)) ;; same keyword, but 1st one is offset, 2nd one is data.

So, using @rossberg's suggested format might remove the problem.

(data (i32.const 10) (i32 20))

echamudi avatar May 28 '20 09:05 echamudi

Yeah, in fact, t.const might even create an ambiguity for passive segments.

rossberg avatar May 28 '20 14:05 rossberg

Ah, good points. This is looking very nice now!

binji avatar May 28 '20 16:05 binji

Hey, I decided to fork and play with the wat2wasm code last weekend 😃. So, here's the sneak peek demo of this proposal: https://wasmprop-numerical-data.netlify.app/wat2wasm/ .

And here are the code changes: code diff

echamudi avatar Jun 09 '20 01:06 echamudi