design
design copied to clipboard
[Text Format] Allow the data in data segments to be written in a vector of u8 and s8 numbers.
Hello,
In the current specification, when we initialize data segments, we are only allowed to use a vector of strings.
![Screen Shot 2020-05-26 at 8 00 30 PM](https://user-images.githubusercontent.com/6957755/82929015-b25e0c80-9f8b-11ea-99c8-a47b3c3adf53.png)
For example:
(data (offset (i32.const 0)) "a" "b" "c")
or
(data (offset (i32.const 0)) "abc")
To save some arbitrary numbers, we can use backslashes and character hex numbers.
(data (offset (i32.const 0)) "\01\02\03\04\05" )
But this approach feels a bit hacky. Wouldn't it better if we are also allowed to put a vector of numbers that can be written in numbers directly?
For example:
;; in decimal
(data (offset (i32.const 0)) 1 2 3 4 5 254 255)
;; in hex
(data (offset (i32.const 0)) 0x01 0x02 0x03 0x04 0x05 0xFE 0xFF)
And I think this addition still conforms with the core syntax definition, which defines the data part as a vector of bytes:
![Screen Shot 2020-05-26 at 8 01 06 PM](https://user-images.githubusercontent.com/6957755/82929241-010ba680-9f8c-11ea-9c53-2c4984ebd6b8.png)
I agree that the current format is pretty limited. That said, I think if we change this we'll want to have something more comprehensive. For example, see this simple raytracer I made. These values are f32
constants, but it's really hard to tell.
We'll probably want syntax for this (maybe (f32.const x)
? Though this raises questions how to handle u8/u16, since there's no u16.const
instruction...)
Interesting demo :)
If we want something more advanced than an array of bytes, maybe we need to introduce special keywords for them? For example:
data ::= '(' 'data' memidx '(' 'offset' expr ')' datacontent* ')'
datacontent ::= string
| '(' 'data_u8' u8 ')'
| '(' 'data_u16' u16 ')'
| '(' 'data_u32' u32 ')'
...
| '(' 'data_f32' f32 ')'
| '(' 'data_f64' f64 ')'
Here's the usage example:
(data (i32.const 0)
"hello"
(data_f32 3.14159265359)
(data_u32 0xABCD)
)
Because they are converted to data in the segment data during the wat2wasm compilation, we don't need to add any additional valtype.const
instructions (e.g. u16.const
, etc).
And the additional grammar above should be backward compatible, as it still accepts strings.
Because they are converted to data in the segment data during the wat2wasm compilation, we don't need to add any additional valtype.const instructions (e.g. u16.const, etc).
Good point. In that case it could use the same format anyway (which is also used in other places too, e.g. assertions), as if we had an u8.const
instruction.
This is a pretty simple change, but still requires going through the proposal process for standardization. Would you want to champion it?
Sure, I would be glad to do that. What are the steps that I should take?
Next steps would be to bring this to a future wasm community group meeting to see whether the group is interested in pursuing it. Are you a community group member? If not, you can join here: https://www.w3.org/community/webassembly/
Suggestion:
datacontent ::= string
| '(' 'data_u8' u8* ')'
| '(' 'data_u16' u16* ')'
| '(' 'data_u32' u32* ')'
...
| '(' 'data_f32' f32* ')'
| '(' 'data_f64' f64* ')'
So for example if we wanted to have a vector of 3 floats, instead of having to say (data_f32 1.0) (data_f32 2.0) (data_f32 3.0)
, we could say (data_f32 1.0 2.0 3.0)
Could take that idea even farther and allow for a mixable data stream. For example, variable width vectors that are prefixed by their length: (data_stream i32 2 f32 1.0 2.0 i32 3 f32 3.0 4.0 5.0)
Not sure how useful that would actually be, just spitballing at this point.
@jgravelle-google's suggestion seems to be useful. For example, this copied code from @binji's ray tracer:
(data (i32.const 0)
"\00\00\00\00" ;; ground.x
"\00\00\c8\42" ;; ground.y
"\00\00\00\00" ;; ground.z
"\00\00\c2\42" ;; ground.r
"\3b\df\6f\3f" ;; ground.R
"\04\56\8e\3e" ;; ground.G
"\91\ed\bc\3e" ;; ground.B
)
can be re-written as
(data (offset (i32.const 0))
(data_f32 0 100.0 0 97.0) ;; ground.{x,y,z,r}
(data_f32
0.936999976635 ;; ground.R
0.277999997139 ;; ground.G
0.368999987841 ;; ground.B
)
)
Regarding @binji's previous comment:
Good point. In that case it could use the same format anyway (which is also used in other places too, e.g. assertions), as if we had an u8.const instruction.
Yeah, maybe it's better to use those formats instead of introducing the new ones. So I'm thinking the possible updated grammar would be this:
data ::= '(' 'data' memidx '(' 'offset' e:expr ')' datacontent* ')'
datacontent ::= string
| '(' 'u8.const' u8+ ')'
| '(' 'u16.const' u16+ ')'
| '(' 'u32.const' u32+ ')'
| '(' 'u64.const' u64+ ')'
| '(' 's8.const' s8+ ')'
| '(' 's16.const' s16+ ')'
| '(' 's32.const' s32+ ')'
| '(' 's64.const' s64+ ')'
| '(' 'f32.const' f32+ ')'
| '(' 'f64.const' f64+ ')'
Nice, I like this! Although we can simplify it further, since we already require that i32.const
etc. handle both signed and unsigned values. So we could have just:
datacontent ::= string
| '(' 'i8.const' i8+ ')'
| '(' 'i16.const' i16+ ')'
| '(' 'i32.const' i32+ ')'
| '(' 'i64.const' i64+ ')'
| '(' 'f32.const' f32+ ')'
| '(' 'f64.const' f64+ ')'
@binji Great, I hope it can make its way to the text format spec. I've just joined the wasm community group with id: echamudi.
@echamudi Thanks! If you can attend one of the next CG meetings (the next is June 9th) to present, that would be best. You can open a PR to add this as an agenda item: https://github.com/WebAssembly/meetings/tree/master/2020
Can we simplify that further to
dataval ::= string
| '(' 'i8' i8+ ')'
| '(' 'i16' i16+ ')'
| '(' 'i32' i32+ ')'
| '(' 'i64' i64+ ')'
| '(' 'f32' f32+ ')'
| '(' 'f64' f64+ ')'
This mirrors what we did for SIMD values, and it avoids mistaking these value encodings for instructions. Also, I would tend to replace +
with *
, to be consistent with other n-ary constructs in the text format.
Looks fine with me. Also, I found out that the previous syntax (especially i32.const
) can be confusing when we use the offset abbreviation.
(data (i32.const 10) (i32.const 20)) ;; same keyword, but 1st one is offset, 2nd one is data.
So, using @rossberg's suggested format might remove the problem.
(data (i32.const 10) (i32 20))
Yeah, in fact, t.const might even create an ambiguity for passive segments.
Ah, good points. This is looking very nice now!
Hey, I decided to fork and play with the wat2wasm code last weekend 😃. So, here's the sneak peek demo of this proposal: https://wasmprop-numerical-data.netlify.app/wat2wasm/ .
And here are the code changes: code diff