tool-conventions Text format for relocation information

@sbc100 and I were just talking about how we might represent some of the relocation information in the WebAssembly text format.

Nothing fully formed yet, but some ideas:

It would be useful to be able to name data segments, then use these in functions, e.g.:

(func
  (i32.load (i32.const $foo)) ...)  ;; This would remap to 10
(data $foo (i32.const 10) "...")

This is a nice feature for the text format, but probably not required for relocations since we also need to be able to define imported memory offset symbols that can only be resolved at link time.

Since memory offset symbols are defined as globals, perhaps we should have some way to annotate these to specify a relocation. The current documentation says that these will be retrieved with get_global, but @sbc100 says that this is out-of-date. Instead, a i32.const 0 (with a 5-byte immediate) is written which will be replaced later by the linker. Perhaps this could be written in the text format using a new sigil for relocations, @:

(import "some" "symbol" (global @extsym i32))
(global @intsym i32 (i32.const 0)) ;; 0 is unused, will be resolved by the linker

(func
  (i32.load (i32.const 0) offset=@intsym))  ;; uses R_WEBASSEMBLY_MEMORY_ADDR_LEB 
  (i32.load (i32.const @intsym))  ;; uses R_WEBASSEMBLY_MEMORY_ADDR_SLEB
  (i32.load (i32.const @extsym))  ;; same
  ...)

;; Data segments now can have names, but only for relocations.
(data @intsym (i32.const 10) ...)  

;; We need a way to store the address of a global in memory as well.
(data (i32.const 100) @extsym)  ;; uses R_WEBASSEMBLY_MEMORY_ADDR_I32

Keeping the global definition is nice because that is what will be generated in the binary format. It's kinda weird to have to define the global and use it later for the segment, though (and has issues with the name being used twice as well). Thoughts on improvements?

Aug 31 '17 19:08 binji

@sbc100 brought up that clang actually generates the known memory address for the object file:

(global @intsym i32 (i32.const 10))  ;; Address matches location below.

(data (i32.const 10) "\01\02\03\04")  ;; Some C global

For convenience, it may be nicer to allow the text format to generate this automatically; but perhaps it's too magical?

(data @intsym (i32.const 10) "\01\02\03\04")

;; Generates a binary file with this global automatically
;; (global @intsym (i32.const 10))

Aug 31 '17 19:08 binji

I think we should think of it in terms of what you'd want to write an assembly file. Basically you never want to have to write a integer literal/numeral. In traditional assembly you can use a name instead of a number for basically anything that has an address: a label (code location) or global (data location). Wasm has a strong separation between these things. We have $-prefix notation to refer to most constructs that are first-class in wasm, and for the most part these are things which do not have addresses in the C sense (and many of which would correspond to labels in assembler). I like the idea of having @-prefixed names for things which have or refer to linear memory addresses. I like what you have in the function example. Also naming segments (i.e. your @intsym example) makes sense as you have it. For segment initializers, why not (data (i32.const @extsym))?

Of course as you noted things get a lot more interesting with globals. I still think it would be nice to be able to declare a global variable with no numbers. Naming a global is fine but I don't like mimicking clang like your first example because you have to put something in the global initializer. You could do something like

(global i32 (i32.const @intsym))  ;; Address matches location below.
(data (i32.const @intsym) "\01\02\03\04")  ;; Some C global

But that's awkward because the @intsym in the global is really like a declaration but it looks like a use. We can name globals already, right? We could use the $ notation for that, and @ would be a use, which when it matches a named global, would mean it gets that global's value?

(global $intsym i32 (i32.const @intsym))  ;; 
(data (i32.const @intsym) "\01\02\03\04")  ;;

That seems better but still a little weird because of the repetition. OTOH maybe we just need something more like an assembler directive. A .byte directive both allocates space and lets you initialize it. Combine with a label and that's all you need. Right? But of course it's not really because there are lots of equivalents (e.g. .zero) and different ways to specify that data (.word). Maybe we want some of those too, so we really want some macro-like layer anyway.

(%.word @intsym 0xfeedcafe)
(%.string @.mystr "hello, world!")
(%.align 4 (%.byte "\01"));; I don't need a label for this, it just has to be in the data section somewhere?

Maybe a directive like this would expand to one of the above ideas. (Thinking about it more though, I still think we want to avoid having to put that "10" in there even in the macro-expanded version).

Aug 31 '17 20:08 dschuff

Well, my goal with this is to at least be able to represent relocations in the text format. Whether they are convenient is a separate issue (one which we've basically punted on for the text format in general). So I don't think I care too much that you have to specify the address of a C global manually (e.g. (data (i32.const 10) ...)). So the assembler-like directives are cool, but seem like they might be more than we need here.

AFAICT, we already can generate the appropriate relocations for all cases except for memory addresses, so we only need to add support for those. Maybe the @ thing isn't actually necessary, and $ can be used everywhere instead? I thought it was necessary, but I think the uses are unambiguous. Well, (i32.const $foo) could be any index, but the only useful values are a memory address or table index.

This still means you have to put something in the global initializer though. One alternative is to use the shortened syntax like we have for (table anyfunc (elem ...)) and (memory (data ...)):

(data $dataName (global i32 (i32.const 10)) "hello world")
(func (i32.load (i32.const $dataName)) ...)
(func (i32.load offset=$dataName (i32.const 0)) ...)

This feels a bit weird, but also kinda makes sense, I think. Not sure what happens if we would allow the global to be named something else:

(data $dataName (global $globalName i32 (i32.const 10)) "hello world")
(func (get_global $globalName) ...)  ;; I guess this works?

Aug 31 '17 22:08 binji

I should say my other goal is that the compiler could use the text format instead of the binary format, and have the result be equivalent and correct (and even better, identical?). So it doesn't have to be convenient necessarily, but I think it does need to be a bit less minimal than you are maybe wanting here? For example if that were the only goal we wouldn't need .word (because we have the byte-string representation) but we would need e.g. .align. Come to think of it, if we had that, we don't need that "10" in our examples at all, it can just always be 0. The assembler can do the first-pass layout of the memory space to generate an object. In that case we don't need to specify alignment. Or perhaps another way of saying that, is that clang can do the layout by virtue of the fact that it supports direct object emission, but compilers that don't, typically also can't do the layout; they rely on an assembler and its various kinds of support for labels and data specifiers. It would be nice if we didn't have to require that of compilers, but we could think of it as an additional incremental feature.

That shortened syntax is convenient, but still seems pretty nonobvious. It looks to me like it should be get_global instead, but of course that means something else. In the wasm text format can you just emit a global declaration from any module-level context, or does it have to be in section order like the binary? If we're not aiming for human writers or convenience it wouldn't be that bad for a compiler to just spit out 2 lines (especially if it could be just anywhere). I kind of do like having the @ syntax for the names though, because it highlights that while most of the other names get mapped to indices in the various wasm-defined index spaces, these names do not.

Aug 31 '17 22:08 dschuff

I should say my other goal is that the compiler could use the text format instead of the binary format, and have the result be equivalent and correct

Ah, I see. That seems a worthwhile goal. This seems to come up a lot; which features can be added to the text format for convenience of the assembler. In addition, should these features be part of the official format or just extensions? Maybe we should bring this up in one of the CG meetings (and bring popcorn).

That shortened syntax is convenient, but still seems pretty nonobvious.

Yeah, I suppose if we're going to add syntactic sugar we should make it sweeter. :-)

In the wasm text format can you just emit a global declaration from any module-level context, or does it have to be in section order like the binary?

Nearly everything in the text format can be specified in any order. The only constraint is that imports must be before all non-imports.

I kind of do like having the @ syntax for the names though

Yeah, I see what you mean. I'm just a little wary of introducing new syntax if we can help it.

Aug 31 '17 23:08 binji

In addition, should these features be part of the official format or just extensions? Maybe we should bring this up in one of the CG meetings (and bring popcorn).

Not a bad idea, although I think that hashing out our use cases here before doing that is a good idea anyway. Having good concrete use cases is also a good way to avoid bikeshedding (or at the very least, you don't have to care about bikeshedding, since any bikeshedders have to keep your use case working in their alternate proposals :grin:)

Nearly everything in the text format can be specified in any order. The only constraint is that imports must be before all non-imports

Even better than that would be the ability to switch sections anywhere, even in the middle of a function...

So taking another swing then, I can see the capabilities at a few possible layers.

Minimal assembler, requires a smart compiler. This is essentially what you proposed initially here. It's just enough for the assembler to generate the relocations required for linking. It requires the compiler to lay its own data section out. Therefore the format and assembler don't need to support alignment directives or any new ways to specify the content of the data section (and perhaps no need to have the concept of more than 1 data section?). The format probably doesn't need any syntactic sugar for things like reducing the duplication or automatically declaring globals (and maybe doesn't want that, since it's feeling pretty low-level). The format is probably pretty much just like the current one, with some extra names where we now have indices.
Assembler lays out the data section. This means the compiler can be simpler, and the assembler needs an additional high-level capability. The compiler emits directives which specify the alignment of data segments (and probably other layout-related directives such as .org) but wouldn't necessarily need new ways to specify its initialized content.

2a) Like 2, but more human-friendly. No new fundamental capabilities, but extra directives, such as alternate ways of specifying data (hex/decimal, word, fill, etc). This kind of thing might somehow be implemented as an extension or macro-type layer if we do it right?

Even more linear. Support switching sections anywhere in a function. This lets a compiler be even more stateless.

Other use cases:

Symbol metadata, e.g. visibility. Some answer to this is probably necessary even at layer 1.
Debug info. Or maybe equivalently, a way to address individual instructions. This is fairly straightforward at layer 3 (e.g. an inline .loc that marks the source location of the next few instructions, or a label preceding an instruction which can be the target of a relocation) but we might need extra syntax if we want it in the lower layers.
Multiple data and/or text sections. A way to segregate data through the linking process, even if it all ends up in linear memory eventually (or alternately, it doesn't but the assembler and/or linker still lays it out and relocates it for you, leaving you with a linked custom section that you can do whatever you like with).
Expressions on symbols, the ever-entertaining 'dot' symbol, etc.

Sep 01 '17 00:09 dschuff

Yeah, I think it would be nice to get to a point where we can do all the things that you'd expect from a "real" assembler. (surprised you didn't mention things like macros and binary includes). I like the idea that we could layer functionality on top, starting with making everything expressible, then making it more convenient to actually use.

Speaking of macros, it might be nicer to just have a sufficiently powerful macro/metaprogramming language for this. Then again, maybe this is overthinking it.

Anyway, I'm glad that you agree that #1 is a good place to start. :-)

Sep 01 '17 18:09 binji

What's wrong with the language currently output by clang -S ...? It currently doesn't have a parser implemented, so it doesn't work right now. But it looks good to my (untrained) eye.

Sep 27 '17 05:09 ElvishJerricco

Really the only thing wrong with it is that it's not much like the official text format. The assembler format needs to have some extra features as discussed here, but ideally the official format would just be a subset of the assembler format, even if only for no other reason than that fewer different representations are better than more. If we decide for some reason that we can't or don't want to do that, then a representation like the existing one would be good, because it looks a lot like assembly languages on other architectures. In fact I'm willing to be convinced that it's better, it just has to be enough better to make it worth having yet another representation style.

Sep 27 '17 15:09 dschuff

Gotcha. I can't provide a very meaningful argument, but I will say that GHC expects the assembly output of LLVM to be fairly regular across languages, since it needs to do some awful hacks to the assembly. The existing language is fine for GHC, but a new one would necessitate some changes. I just wonder if GHC is totally unique in this, or if anyone else has come to rely on the common shape of the assembly languages used by LLVM. At the very least, I could imagine someone in the future being frustrated by the fact that WebAssembly is the only major backend with a 100% custom assembly language.

FWIW, here is a good overview on a few of the problems GHC currently faces with LLVM and what is being done to fix them. Due to a massive amount of technical debt in GHC, several of the unfortunate design decisions are unlikely to be fixed any time soon.

Sep 27 '17 15:09 ElvishJerricco